GEP: Standardized Telemetry API
- Issue: #4768
- Status: Provisional
TLDR
This proposal introduces a standardized, provider-agnostic Telemetry API to configure observability signals (metrics, access logs, and traces) for North/South (Gateway) traffic, addressing the fragmentation caused by vendor-specific CRDs.
Goals
- Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for Gateways.
Non-Goals
- Defining how the telemetry is exported (sinks/shippers) beyond specifying the provider endpoint and relevant connectivity parameters.
- Replacing the underlying telemetry infrastructure (OTLP collectors, Prometheus, etc.).
- Standardizing metrics; this proposal exclusively focuses on the telemetry configuration API.
Introduction / Overview
This GEP proposes the addition of a standardized, provider-agnostic Telemetry API to the Gateway API project. The proposal aims to define a unified configuration model for the generation and propagation of telemetry signals (i.e., metrics, access logs, distributed traces) for North/South (Gateway) traffic.
The API focuses on providing a consistent way to express observability intent, such as sampling rates for tracing, metric customization, and log filtering, regardless of the underlying data plane implementation.
Purpose (Why and Who)
The Fragmentation of Observability
In the current Kubernetes landscape, the “Who, What, Where, and How Long” of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via HTTPRoute and Gateway, it has deferred the standardization of how that traffic is observed. This deferral has led to “Observability Lock-in”. Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to:
- Enforce consistent auditing and observability standards across different infrastructure providers.
- Support emerging workloads like AI Agents, which elevate the criticality of observability due to their autonomous, non-deterministic nature and requirements for specialized signals.
Who
- Platform Operators: Need to ensure uniform observability across all networking infrastructure.
- Observability Teams: Responsible for the governance of telemetry data. They need to define and enforce standardized schemas and collection policies across the entire organization.
- Security/Auditing Teams: Require a standardized audit trail for all traffic, an increasingly important need with the emergence of autonomous agent actions.
- Application Developers: Benefit from consistent metrics and traces for debugging without worrying about the underlying gateway technology.
API
Policy Attachment vs. Inline Configuration
A key area of discussion for this GEP is whether this should be a standalone Policy Attachment (e.g., TelemetryPolicy) or inline configuration within Gateway or HTTPRoute resources.
This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons:
- Separation of Concerns: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. Telemetry is typically configured by platform, observability, or security engineers rather than application developers. This also implies that HTTPRoute is not the ideal resource to target for the initial API implementation.
- Uniformity: It enables a single policy to be applied uniformly across a set of Gateways, eliminating the need to duplicate complex telemetry configurations across individual resources.
To mitigate the challenge of complex merging semantics, this GEP restricts configuration such that only a single TelemetryPolicy can target a specific Gateway at any given time. If multiple TelemetryPolicy resources target the same object, precedence is determined based on the creation timestamp. This will allow us to start with simple config and iterate based on feedback whether multiple TelemetryPolicies on the same target are needed.
High-level Considerations:
- Tracing: Configuration for OTLP endpoints, sampling rates (probabilistic and parent-based), and custom resource/span attributes.
- Metrics: Ability to enable/disable specific metric families and customize dimensions (labels/attributes).
- Access Logs: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection).
- Export Configuration: Supporting TLS connections to telemetry collectors and the ability to inject custom headers (e.g.,
Authorization) into telemetry requests.
Request Flow
- A platform operator creates a
TelemetryPolicyresource targeting aGateway. - The Gateway API implementation reconciles this resource and configures the underlying data plane.
- The data plane extracts the specified signals and exports them to the telemetry infrastructure.
The TelemetryPolicy Specification
We propose the TelemetryPolicy as a direct policy attachment in the gateway.networking.k8s.io API group. See GEP-713 for more information on direct attachment.
The following is an example that demonstrates the structure of the TelemetryPolicy.
apiVersion: agentic.networking.x-k8s.io/v1alpha1
kind: TelemetryPolicy
metadata:
name: standard-telemetry
namespace: prod-ns
spec:
# GEP-713 Attachment
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: my-gateway
# 1. Tracing Configuration
tracing:
mode: "On"
provider:
endpoint: "otel-collector.monitoring.svc:4317"
samplingRate:
numerator: 5 # Represents 5/100 (5%) because denominator defaults to 100
parentBasedSampling:
mode: "On"
samplingRate:
numerator: 50 # Represents 50/100 (50%)
customAttributes:
- name: "env"
type: Literal
literalValue: "production"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"
# 2. Metrics Configuration
metrics:
mode: "On"
overrides:
- name: "example.com/http/request_count"
type: Counter
attributes: # Inject custom attributes/labels
- name: "x-model-id"
type: Header
headerName: "X-Model-Id"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"
- name: "environment"
type: Literal
literalValue: "production"
# 3. Access Logs Configuration
accessLogs:
mode: "Off" # Explicitly disabled while keeping the configuration intact
matches: "response.code >= 500" # Conditional logging, CEL filtering for errors
fields: # Configure specific fields to include, indicating their source
- name: "start_time"
type: Standard
standardValue: "RequestStartTime"
- name: "response_code"
type: Standard
standardValue: "ResponseCode"
- name: "x-token-usage"
type: Header
headerName: "X-Token-Usage"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"
Detailed Resource Description
The following are the Go structs modeling the proposed specification.
// TelemetryPolicy defines a direct policy attachment to configure
// observability signals for Gateways.
type TelemetryPolicy struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec TelemetryPolicySpec `json:"spec"`
// status defines the observed state of TelemetryPolicy.
// +optional
Status TelemetryPolicyStatus `json:"status,omitempty"`
}
type TelemetryPolicySpec struct {
// Identifies the target gateways to which this policy attaches (GEP-713).
TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"`
// Configuration for distributed tracing options.
Tracing *TracingConfig `json:"tracing,omitempty"`
// Configuration for metric generation and exports.
Metrics *MetricsConfig `json:"metrics,omitempty"`
// Configuration for access log generation.
AccessLogs *AccessLogsConfig `json:"accessLogs,omitempty"`
}
// TelemetryMode defines the enablement state of a telemetry signal.
type TelemetryMode string
const (
// TelemetryModeOn explicitly enables the telemetry signal.
TelemetryModeOn TelemetryMode = "On"
// TelemetryModeOff explicitly disables the telemetry signal.
TelemetryModeOff TelemetryMode = "Off"
)
// --- Tracing Types ---
type TracingConfig struct {
// Mode explicitly controls if tracing is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
// Specifies the tracing backend. Includes type (e.g., "OTLP") and endpoint.
Provider *TracingProvider `json:"provider,omitempty"`
// The base sampling probability for traces.
SamplingRate *Fraction `json:"samplingRate,omitempty"`
// Configures whether to respect the sampling decision of the parent span.
ParentBasedSampling *ParentBasedSampling `json:"parentBasedSampling,omitempty"`
// Allows appending custom tags/attributes to spans.
CustomAttributes []CustomAttribute `json:"customAttributes,omitempty"`
}
type TracingProvider struct {
Endpoint string `json:"endpoint,omitempty"`
}
type Fraction struct {
Numerator int32 `json:"numerator"`
// +kubebuilder:default=100
// +kubebuilder:validation:Minimum=1
Denominator int32 `json:"denominator,omitempty"` // Allows e.g., 1 / 10000 for 0.01%
}
type ParentBasedSampling struct {
// Mode explicitly controls if parent-based sampling is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
// The sampling rate to apply when the parent span decision is used.
SamplingRate *Fraction `json:"samplingRate,omitempty"`
}
// CustomAttributeType defines the source of a trace attribute's value.
type CustomAttributeType string
const (
// CustomAttributeTypeHeader extracts the value from an HTTP header.
CustomAttributeTypeHeader CustomAttributeType = "Header"
// CustomAttributeTypeMetadata extracts the value from proxy metadata or context.
CustomAttributeTypeMetadata CustomAttributeType = "Metadata"
// CustomAttributeTypeLiteral provides a static, user-defined string value.
CustomAttributeTypeLiteral CustomAttributeType = "Literal"
)
type CustomAttribute struct {
// Name is the key of the attribute as it will appear in the trace span.
Name string `json:"name"`
// Type specifies where the attribute value comes from.
// Valid values are "Header", "Metadata", or "Literal".
// +kubebuilder:validation:Enum=Header;Metadata;Literal
Type CustomAttributeType `json:"type"`
// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`
// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`
// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`
}
// --- Metrics Types ---
type MetricsConfig struct {
// Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
// List of configurations to customize specific metric families.
Overrides []MetricOverride `json:"overrides,omitempty"`
}
type MetricOverride struct {
// The metric name to override (e.g., "http_requests_total" or "gateway.networking.k8s.io/http/request_count").
Name string `json:"name"`
// Type of the metric (e.g., "Counter", "Histogram").
Type string `json:"type,omitempty"`
// Defines custom attributes to attach to the metric.
// These are appended to the standard labels emitted by the proxy.
Attributes []MetricAttribute `json:"attributes,omitempty"`
}
// MetricAttributeType defines the source of a metric attribute's value.
type MetricAttributeType string
const (
// MetricAttributeTypeHeader extracts the value from an HTTP header.
MetricAttributeTypeHeader MetricAttributeType = "Header"
// MetricAttributeTypeMetadata extracts the value from proxy metadata or context.
MetricAttributeTypeMetadata MetricAttributeType = "Metadata"
// MetricAttributeTypeLiteral provides a static, user-defined string value.
MetricAttributeTypeLiteral MetricAttributeType = "Literal"
)
type MetricAttribute struct {
// Name is the key of the attribute as it will appear in the metric.
Name string `json:"name"`
// Type specifies where the attribute value comes from.
// Valid values are "Header", "Metadata", or "Literal".
// +kubebuilder:validation:Enum=Header;Metadata;Literal
Type MetricAttributeType `json:"type"`
// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`
// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`
// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`
}
// --- Access Logs Types ---
type AccessLogsConfig struct {
// Mode explicitly controls if access logging is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
// CEL expression for advanced filtering (e.g., matching response codes, headers).
Matches string `json:"matches,omitempty"`
// A list of specific fields or headers to include in the logs.
Fields []string `json:"fields,omitempty"`
// A list of specific fields to include in the logs, specifying their source.
Fields []LogField `json:"fields,omitempty"`
}
// LogFieldType defines the source of a log field's value.
type LogFieldType string
const (
// LogFieldTypeHeader extracts the value from an HTTP header.
LogFieldTypeHeader LogFieldType = "Header"
// LogFieldTypeMetadata extracts the value from proxy metadata or context.
LogFieldTypeMetadata LogFieldType = "Metadata"
// LogFieldTypeLiteral provides a static, user-defined string value.
LogFieldTypeLiteral LogFieldType = "Literal"
// LogFieldTypeStandard extracts a standard proxy log value (e.g., duration, start time).
LogFieldTypeStandard LogFieldType = "Standard"
)
type LogField struct {
// Name is the key/name of the field as it will appear in the access log output.
Name string `json:"name"`
// Type specifies where the field value comes from.
// Valid values are "Header", "Metadata", "Literal", or "Standard".
// +kubebuilder:validation:Enum=Header;Metadata;Literal;Standard
Type LogFieldType `json:"type"`
// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`
// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`
// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`
// StandardValue specifies a standard log property (e.g., "RequestStartTime", "Duration").
// This is required if Type is "Standard".
StandardValue *string `json:"standardValue,omitempty"`
}
// --- Policy Status ---
// TelemetryPolicyStatus defines the observed state of TelemetryPolicy.
type TelemetryPolicyStatus struct {
// For Policy Status API conventions, see:
// https://gateway-api.sigs.k8s.io/geps/gep-713/#the-status-stanza-of-policy-objects
//
// Ancestors is a list of ancestor resources (usually Backend) that are
// associated with the policy, and the status of the policy with respect to
// each ancestor. When this policy attaches to a parent, the controller that
// manages the parent and the ancestors MUST add an entry to this list when
// the controller first sees the policy and SHOULD update the entry as
// appropriate when the relevant ancestor is modified.
//
// Note that choosing the relevant ancestor is left to the Policy designers;
// an important part of Policy design is designing the right object level at
// which to namespace this status.
//
// Note also that implementations MUST ONLY populate ancestor status for
// the Ancestor resources they are responsible for. Implementations MUST
// use the ControllerName field to uniquely identify the entries in this list
// that they are responsible for.
//
// Note that to achieve this, the list of PolicyAncestorStatus structs
// MUST be treated as a map with a composite key, made up of the AncestorRef
// and ControllerName fields combined.
//
// A maximum of 16 ancestors will be represented in this list. An empty list
// means the Policy is not relevant for any ancestors.
//
// If this slice is full, implementations MUST NOT add further entries.
// Instead they MUST consider the policy unimplementable and signal that
// on any related resources such as the ancestor that would be referenced
// here.
//
// +required
// +listType=atomic
// +kubebuilder:validation:MaxItems=16
Ancestors []PolicyAncestorStatus `json:"ancestors"`
}
Comparison with Prior Art
Istio
Istio’s Telemetry API is the most direct prior art that inspired this proposal. It allows configuring observability at the mesh, namespace, and workload level.
- Metrics: Istio allows users to enable/disable specific metrics, add custom dimensions, and configure providers.
- Logs: Istio supports access logging configurations with CEL-like expressions for advanced filtering.
- Traces: Istio supports probabilistic sampling, context propagation, and custom span tags.
- Customization: For advanced telemetry use-cases not natively covered by the
TelemetryAPI, Istio users can fall back to usingEnvoyFilterresources. While highly flexible,EnvoyFilterrequires deep knowledge of Envoy’s internal xDS API. This is tightly coupled to the data plane implementation and can be brittle across version upgrades. - Comparison: The proposed
TelemetryPolicyadapts Istio’s powerful intent-based capabilities to the standardized Gateway API attachment model.
Envoy Gateway
Envoy Gateway configures observability through two distinct custom resources: EnvoyGateway for the control plane and EnvoyProxy for the underlying data plane proxies.
- Metrics: Envoy Gateway allows configuring Prometheus and OpenTelemetry sinks for both the control plane (using
EnvoyGatewayCRD) and the data plane proxies (using theEnvoyProxyCRD). - Logs: Proxy access logs are configured via the
EnvoyProxyresource. It supports exporting to file, OTLP, or gRPC Access Log Service (ALS) sinks. It uses CEL expressions for smart filtering (e.g., matching specific headers), and allows applying log configurations at the Route or Listener level. - Tracing: Tracing is configured in the
EnvoyProxyresource. It allows configuring sampling and supports appending custom tags derived from literals, environment variables, or request headers. - Customization: For advanced telemetry use-cases not covered natively, users can fall back to the
EnvoyPatchPolicyAPI to mutate the underlying xDS configuration using JSON Patch semantics. This is similar to Istio’sEnvoyFilter. - Comparison: While Envoy Gateway provides a robust, native telemetry configuration, it is tightly coupled to infrastructure-oriented CRDs. The proposed
TelemetryPolicyallows users to configure telemetry behaviors using a portabletargetRefmodel, without binding their observability intent to an Envoy-specific schema.
Kuadrant
Kuadrant provides observability for API management features like rate limiting and authentication. It is configured through a mix of its own custom resources and the underlying gateway’s APIs.
- Metrics: Kuadrant enables metrics via the
KuadrantCR. It also introduces its ownTelemetryPolicyAPI (extensions.kuadrant.io/v1alpha1) to add custom dimensions to metrics. - Logs: For proxy access logging, Kuadrant relies on the underlying gateway provider (e.g., Istio’s Telemetry API). However, it configures request correlation across its own components (Authorino, Limitador, and Wasm-shim) by specifying HTTP header identifiers in the
KuadrantCR. - Tracing: Tracing is configured centrally via the
KuadrantCR. It exports OpenTelemetry spans for both the control plane and data plane components. It supports global trace filtering levels to control the verbosity of exported spans. - Customization: To make low-level, custom modifications to the data plane configuration that are not supported by Kuadrant’s native APIs, users can bypass Kuadrant and directly use the underlying gateway’s mechanisms.
- Comparison: While Kuadrant provides powerful, identity-aware telemetry (like token tracking per user), its configuration is fragmented across the
KuadrantCR, components specific CRDs, its custom extensionTelemetryPolicy, and the underlying gateway’s native APIs. The proposedTelemetryPolicyaims to unify these intent-based capabilities into a single, provider-agnostic resource.