Kubernetes 1.16: Highlights Overview

Today, on Wednesday, the next release of Kubernetes will be held - 1.16. According to the tradition that has developed for our blog, for the tenth anniversary time, we are talking about the most significant changes in the new version.

The information used to prepare this material is taken from the Kubernetes enhancements tracking table , CHANGELOG-1.16 and related issues, pull requests, as well as Kubernetes Enhancement Proposals (KEP). So let's go! ..

Knots

A truly large number of notable innovations (in the alpha version status) are presented on the side of the nodes of K8s-clusters (Kubelet).

Firstly, the so-called " ephemeral containers " (Ephemeral Containers) , designed to simplify the process of debugging in pod'ah . The new mechanism allows you to run special containers that start in the namespace of existing pods and live for a short time. Their purpose is to interact with other pods and containers in order to solve any problems and debugging. For this feature, a new kubectl debug command is kubectl debug , similar in essence to kubectl exec : only instead of starting the process in the container (as in the case of exec ) it starts the container in pod. For example, such a command will connect a new container to the pod:

 kubectl debug -c debug-shell --image=debian target-pod -- bash

Details on ephemeral containers (and examples of their use) can be found in the corresponding KEP . The current implementation (in K8s 1.16) is the alpha version, and among the criteria for its transfer to the beta version is “testing the Ephemeral Containers API for at least 2 releases [Kubernetes]”.

NB : In essence and even the name of the feature resembles the already existing kubectl-debug plugin, which we already wrote about . It is assumed that with the advent of ephemeral containers, the development of a separate external plug-in will stop.

Another innovation, PodOverhead is designed to provide a mechanism for calculating overhead costs for pods , which can vary greatly depending on the runtime used. As an example, the authors of this KEP cite Kata Containers, which require the launch of the guest kernel, kata agent, init system, etc. When the overhead becomes so large, it cannot be ignored, which means that a way is needed to take it into account for further quotas, planning, etc. To implement PodSpec , the Overhead *ResourceList field has been added to PodSpec (mapped to data in the RuntimeClass , if one is used).

Another notable innovation is the Node Topology Manager , designed to unify the approach to fine-tuning the distribution of hardware resources for various components in Kubernetes. This initiative is caused by the growing demand of various modern systems (from the field of telecommunications, machine learning, financial services, etc.) for high-performance parallel computing and minimizing delays in the execution of operations, for which they use the advanced capabilities of CPU and hardware acceleration. Such optimizations in Kubernetes have so far been achieved thanks to disparate components (CPU manager, Device manager, CNI), and now they will add a single internal interface that unifies the approach and simplifies the connection of new similar - the so-called topology-aware - components on the Kubelet side. Details are in the corresponding KEP .

Topology Manager Component Diagram

The next feature is checking containers during startup ( startup probe ) . As you know, for containers that run for a long time, it is difficult to get the current status: they are either "killed" before the actual start of operation, or they end up in a deadlock for a long time. A new check (enabled through the feature gate called StartupProbeEnabled ) cancels - or rather, postpones - the action of any other checks until the moment the pod has finished its launch. For this reason, the feature was originally called pod-startup liveness-probe holdoff . For pods that start for a long time, it is possible to conduct a state survey in relatively short time intervals.

In addition, immediately in beta status an improvement for RuntimeClass is added, adding support for “heterogeneous clusters”. With RuntimeClass Scheduling, now it’s not necessary for every node to have support for each RuntimeClass: for pods, you can choose RuntimeClass without thinking about the cluster topology. Previously, to achieve this — in order for pods to appear on nodes with support for everything they needed — they had to assign appropriate rules to NodeSelector and tolerations. KEP talks about usage examples and, of course, implementation details.

Network

Two significant network features that first appeared (in the alpha version) in Kubernetes 1.16 are:

Support for a dual network stack - IPv4 / IPv6 - and its corresponding "understanding" at the level of pods, nodes, services. It includes the interaction of IPv4-to-IPv4 and IPv6-to-IPv6 between pods, from pods to external services, reference implementations (as part of Bridge CNI, PTP CNI and Host-Local IPAM plug-ins), as well as the reverse Compatible with Kubernetes clusters that work only over IPv4 or IPv6. Implementation details are in KEP .

An example of outputting two types of IP addresses (IPv4 and IPv6) in the list of pods:
```
 kube-master# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-controller 1/1 Running 0 20m fd00:db8:1::2,192.168.1.3 kube-minion-1 kube-master# 
```
The new API for Endpoint is the EndpointSlice API . It solves the problems of the existing Endpoint API with performance / scalability that affect various components in the control-plane (apiserver, etcd, endpoints-controller, kube-proxy). The new API will be added to the Discovery API group and will be able to serve tens of thousands of backend endpoints on each service in a cluster consisting of a thousand nodes. To do this, each Service is mapped to N EndpointSlice objects, each of which by default has no more than 100 endpoints (the value is configurable). The EndpointSlice API will also provide opportunities for its future development: support for many IP addresses for each pod, new states for endpoints (not only Ready and NotReady ), dynamic subsetting for endpoints.

The finalizer presented in the last release called service.kubernetes.io/load-balancer-cleanup and attached to each service with the type LoadBalancer advanced to the beta version. At the time of removal of such a service, it prevents the actual deletion of the resource until the "cleansing" of all the corresponding resources of the balancer is completed.

API Machinery

The real "stabilization milestone" is fixed in the area of the Kubernetes API server and interaction with it. In many respects, this happened due to the transfer to the stable status of CustomResourceDefinitions (CRD) that did not need a special presentation , which had beta status since the distant Kubernetes 1.7 (and this is June 2017!). The same stabilization came to the features related to them:

"Subresources" with /status and /scale for CustomResources;
version conversion for CRD, based on an external webhook;
recently introduced (in K8s 1.15) default values (defaulting) and automatic field deletion (pruning) for CustomResources;
the possibility of using the OpenAPI v3 scheme for creating and publishing OpenAPI documentation used to validate CRD resources on the server side.

Another mechanism that has long been familiar to Kubernetes administrators: admission webhook , has also been in beta status for a long time (since K8s 1.9) and has now been declared stable.

Two other features reached beta: server-side apply and watch bookmarks .

And the only significant innovation in the alpha version was the rejection of SelfLink - a special URI that represents the specified object and is part of ObjectMeta and ListMeta (i.e., part of any object in Kubernetes). Why refuse it? The “simple” motivation sounds like the absence of real (insurmountable) reasons for this field to continue to exist. More formal reasons are to optimize performance (removing an unnecessary field) and simplify the work of generic-apiserver, which is forced to process such a field in a special way (this is the only field that is set right before the object is serialized). The real "obsolescence" (in the beta version) of SelfLink will happen to Kubernetes version 1.20, and the final one - 1.21.

Data storage

The main work in the field of storage, as in previous releases, is observed in the field of support for CSI . The main changes here are:

for the first time (in the alpha version) , support for CSI plug-ins for working nodes with Windows appeared : the actual way to work with repositories will replace the in-tree plugins in the Kubernetes core and Powershell-based FlexVolume plug-ins from Microsoft;

Kubernetes Windows CSI Plugin Implementation Scheme
the ability to resize CSI volumes , introduced back in K8s 1.12, has grown to a beta version;
the possibility of using CSI to create local ephemeral volumes ( CSI Inline Volume Support ) has reached a similar “increase” (from alpha to beta).

The function for cloning volumes that appeared in the previous version of Kubernetes (using existing PVCs as a DataSource to create new PVCs) has also now received beta status.

Scheduler

Two notable changes in planning (both in the alpha version):

EvenPodsSpreading is the ability to use EvenPodsSpreading to "fairly distribute" loads instead of application logical units (like Deployment and ReplicaSet) and adjust this distribution (as a strict requirement or as a mild condition, i.e. priority). The feature will expand the existing distribution capabilities of planned pods, now limited by the PodAffinity and PodAntiAffinity , giving administrators more control over this issue, which means better accessibility and optimized resource consumption. Details are in the KEP .
Using the BestFit Policy in the RequestedToCapacityRatio Priority Function during pod scheduling, which allows bin packing (“packing in containers”) to be used for both core resources (processor, memory) and extended (like GPU). See KEP for details.

Pod scheduling: before using the best fit policy (directly through default scheduler) and using it (via scheduler extender)

In addition, the opportunity is presented to create your own plug-ins for the scheduler outside the main Kubernetes (out-of-tree) development tree.

Other changes

Also in Kubernetes 1.16 release, one can note an initiative to bring existing metrics in full order , or more precisely, in accordance with official K8s instrumentation requirements. They basically rely on the relevant Prometheus documentation . The inconsistencies were formed for various reasons (for example, some metrics were simply created before the current instructions appeared), and the developers decided that it was time to bring everything to a single standard, "in line with the rest of the Prometheus ecosystem." The current implementation of this initiative has the status of the alpha version, which will gradually increase in subsequent versions of Kubernetes to beta (1.17) and stable (1.18).

In addition, the following changes can be noted:

Development of Windows support with the advent of the Kubeadm utility for this OS (alpha version), the possibility of RunAsUserName for Windows containers (alpha version), improvement of support for the Group Managed Service Account (gMSA) to beta version, mount / attach support for vSphere volumes.
Redesigned data compression mechanism in API responses . Previously, an HTTP filter was used for these purposes, which imposed a number of restrictions that prevented its inclusion by default. Now transparent query compression works: clients sending Accept-Encoding: gzip in the header receive a GZIP-compressed response if its size exceeds 128 Kb. Clients on Go automatically support compression (send the desired header), so they immediately notice a decrease in traffic. (For other languages, minor modifications may be required.)
It became possible to scale HPA from / to zero pods based on external metrics . If scaling is based on objects / external metrics, then when workloads are idle, you can automatically scale to 0 replicas to save resources. This feature should be especially useful for cases when workers request GPU resources, and the number of different types of idle workers exceeds the number of available GPUs.
A new client - k8s.io/client-go/metadata.Client - for "generalized" access to objects. It is designed to easily obtain metadata (i.e., the metadata subsection) from cluster resources and perform operations with them from the category of garbage collection and quotas.
Kubernetes can now be built without outdated ("built-in" in-tree) cloud providers (alpha version).
The experimental (alpha version) ability to apply kustomize patches during init , join and upgrade operations has been added to the kubeadm utility. For details on how to use the --experimental-kustomize , see KEP .
The new endpoint for apiserver is readyz , which allows you to export readiness information. The API server also has a flag --maximum-startup-sequence-duration , which allows you to adjust its restarts.
Two features for Azure are declared stable: support for Availability Zones and cross resource group (RG). In addition, Azure added:
- AAD and ADFS authentication support
- annotation service.beta.kubernetes.io/azure-pip-name to specify the public IP of the load balancer;
- Ability to configure LoadBalancerName and LoadBalancerResourceGroup .
AWS has support for EBS on Windows and optimized EC2 API calls DescribeInstances .
Kubeadm now migrates its CoreDNS configuration on its own when upgrading to CoreDNS.
The binaries etcd in the corresponding Docker image made world-executable, which allows you to run this image without the need for root privileges. In addition, the etcd migration image has dropped support for the etcd2 version.
Cluster Autoscaler 1.16.0 switched to using distroless as a base image, improved performance, and added new cloud providers (DigitalOcean, Magnum, Packet).
Updates in the used / dependent software: Go 1.12.9, etcd 3.3.15, CoreDNS 1.6.2.