Cluster configuration

Cluster configuration and optional features

There are a lot of options possible for your cluster. Most options have a sane default however could be overridden on request.

A default cluster comes with 3 control plane and 3 worker nodes. To connect all nodes we create a network, default (10.128.0.0/22). We also deploy monitoring to ensure functionality of all cluster components. However most things are just a default and could be overridden.

Common options

Nodes

The standard configuration consists of the following:

  • Three control plane nodes, one in each of our availability zones. Flavor: v2-c2-m8-d80
  • Three worker nodes, one in each of our availability zones, in a single nodegroup. Flavor: v2-c2-m8-d80

Minimal configuration

  • Three control plane nodes, one in each of our availability zones. Flavor: v2-c2-m8-d80

  • One worker node, Flavor: v2-c2-m8-d80

    This is the minimal configuration offered. Scaling to larger flavors and adding nodes are supported. Autoscaling is not supported with a single worker node.

    Note: SLA is different for minimal configuration type of cluster. SLA’s can be found here.

Nodegroups and multiple flavors

A nodegroup contains of one or multiple nodes with the same flavor and a list of availability zones to deploy nodes in. Clusters are default delivered with a single nodegroup containing 3 nodes, one in each AZ. Each nodegroup is limited to one flavor.

You could have multiple nodegroups, if you for example want to target workload on separate nodes or in case you wish to consume multiple flavors.

A few examples of nodegroups:

Name Flavour AZ list Min node count Max node count (autoscaling)
worker v2-c2-m8-d80 STO1, STO2, STO3 3 0
database d2-c8-m120-d1.6k STO1, STO2, STO3 3 0
frontend v2-c4-m16-d160 STO1, STO2, STO3 3 12
jobs v2-c4-m16-d160 STO1 1 3

In the examples we could see worker our default nodegroup and an example of having separate nodes for databases and frontend where the database is running on dedicated nodes and the frontend is running on smaller nodes but can autoscale between 3 and 12 nodes based on current cluster request. We also have a jobs nodegroup where we have one node in sto1 but can scale up to 3 nodes where all are placed inside STO1. You can read more about autoscaling here.

Nodegroups can be changed at any time. Please also note that we have auto-healing meaning in case any of your nodes for any reason stops working we will replace them. More about autohealing could be found here.

Worker nodes Floating IPs

By default, our clusters come with nodes that do not have any Floating IPs attached to them. If, for any reason, you require Floating IPs on your workload nodes, please inform us, and we can configure your cluster accordingly. It’s worth noting that the most common use case for Floating IPs is to ensure predictable source IPs. However, please note that enabling or disabling Floating IPs will necessitate the recreation of all your nodes.

Since during upgrades we create a new node prior to removing an old node you would need to have an additional IP adress on standby. Thus, for a 3 worker nodes, with with autoscaling up to 5 nodes, we will allocate 6 IPs.

Network

By default we create a node network (10.128.0.0/22). However we could use another subnet per customer request. The most common scenario is when customer request another subnet is when exposing multiple Kubernetes clusters over a VPN.

Please make sure to inform us if you wish to use a custom subnet during the ordering process since we cannot replace the network after creation, meaning we would then need to recreate your entire cluster.

We currently only support cidr in the 10.0.0.0/8 subnet range and at least a /24. Both nodes and loadbalancers are using IPs for this range meaning you need to have a sizable network from the beginning.

Cluster domain

We default all clusters to “cluster.local”. If you wish to have another cluster domain please let us know during the ordering procedure since it cannot be replaced after cluster creation.

OIDC

If you wish to integrate with your existing OIDC compatible IDP, example Microsoft AD And Google Workspace that is supported directy in the kubernetes api service.

By default we ship clusters with this option disabled however if you wish to make use of OIDC just let us know when order the cluster or afterwards. OIDC can be enabled, disabled or changed at any time.

Kubelet configurations and resource reservations

We make a few adaptations to Kubernetes vanilla settings.

  • NodeDrainVolume and NodeDrainTimeout: 5 -> 15min

    • Increased duration to 15 minutes to allow more time for graceful shutdown and controlled startup of workload on new nodes, while respecting PodDisruptionBudgets.
  • podPidsLimit: 0 → 4096

    • Added safety net of a maximum of Per-pod PIDs (process IDs), that is limited and enforced by the kubelet. We used to not have any limitation. Setting this to 4096 limits how many PIDs a single pod may create, which helps mitigate runaway processes or fork-bombs.
  • serializeImagePulls: true → false

    • Allows the kubelet to pull multiple images in parallel, speeding up startup times.
  • maxParallelImagePulls: 0 → 10

    • Controls the maximum number of image pulls the kubelet will perform in parallel.

Resource reservations on worker nodes

To improve stability and predictability of the core Kubernetes functionality during heavy load, we introduce node reservations for CPU, memory, and ephemeral storage.

The reservation model follows proven hyperscaler formulas but is tuned conservatively, ensuring more allocatable resources.

Hyperscalers tend to not make a distinction of systemReserved and kubeReserved, and bundle all reservations into and kubeReserved. We make use of both, but skewed towards kube reservations to align closer with Hyperscalers, but still maintain the reservations of the system. We calculate the reservations settings based on cpu cores, memory and storage of each flavor dynamically.

Here we’ve provided a sample of what to expect:

CPU Reservations Table

Cores (int) System reserved (millicores) Kube reserved (millicores) Allocatable of node (%)
2 35 120 92%
4 41 180 94%
8 81 240 96%
16 83 320 97%
32 88 480 98%
64 98 800 99%

Memory Reservations

Memory (Gi) System reserved (Gi) Kube reserved (Gi) Reserved total (Gi) Eviction Soft (Gi) Eviction Hard (Gi) Allocatable of node (%)
8 0.4 1.0 1.4 0.00 0.25 79%
16 0.4 1.8 2.2 0.00 0.25 85%
32 0.4 3.4 3.8 0.00 0.25 87%
64 0.4 3.7 4.1 0.00 0.25 93%
120 0.4 4.3 4.7 0.00 0.25 96%
240 0.4 4.5 4.9 0.00 0.25 98%
384 0.4 6.9 7.3 0.00 0.25 98%
512 0.4 8.2 8.6 0.00 0.25 98%

Ephemeral Disk Reservations

NOTE: We use the default of nodefs.available at 10%.

Storage (Gi) System reserved (Gi) Kube reserved (Gi) Reserved total (Gi) Eviction Soft (Gi) Eviction Hard (Gi) Allocatable of node (%)
60 12.0 1.0 13.0 0.0 6.0 68%
80 12.0 1.0 13.0 0.0 8.0 74%
120 12.0 1.0 13.0 0.0 12.0 79%
240 12.0 1.0 13.0 0.0 24.0 85%
1600 12.0 1.0 13.0 0.0 160.0 89%

Cluster add-ons

We currently offer managed cert-manager, NGINX Ingress and elx-nodegroup-controller.

Cert-manager

Cert-manager (link to cert-manager.io) helps you to manage TLS certificates. A common use case is to use lets-encrypt to “automatically” generate certificates for web apps. However the functionality goes much deeper. We also have usage instructions and have a guide if you wish to deploy cert-manager yourself.

Ingress

An ingress controller in a Kubernetes cluster manages how external traffic reaches your services. It routes requests based on rules, handles load balancing, and can integrate with cert-manager to manage TLS certificates. This simplifies traffic handling and improves scalability and security compared to exposing each service individually. We have a usage guide with examples that can be found here.

We have chosen to use ingress-nginx and to support ingress, we limit what custom configurations can be made per cluster. We offer two “modes”. One that we call direct mode, which is the default behavior. This mode is used when end-clients connect directly to your ingress. We also have a proxy mode for when a proxy (e.g., WAF) is used in front of your ingress. When running in proxy mode, we also have the ability to limit traffic from specific IP addresses, which we recommend doing for security reasons. If you are unsure which mode to use or how to handle IP whitelisting, just let us know and we will help you choose the best options for your use case.

If you are interested in removing any limitations, we’ve assembled guides with everything you need to install the same IngressController as we provide. This will give you full control. The various resources give configuration examples and instructions for lifecycle management. These can be found here.

elx-nodegroup-controller

The nodegroup controller is useful when customers want to use custom taints or labels on their nodes. It supports matching nodes based on nodegroup or by name. The controller can be found on Github if you wish to inspect the code or deploy it yourself.