Shifting left with telemetry pipelines: The future of data tiering at petabyte scale

Software Development

Shifting left with telemetry pipelines: The future of data tiering at petabyte scale

news_fiverr

November 5, 2024

Shifting left with telemetry pipelines: The future of data tiering at petabyte scale

In today’s rapidly evolving observability and security use cases, the concept of “shifting left” has moved beyond just software development. With the consistent and rapid rise of data volumes across logs, metrics, traces, and events, organizations are required to be a lot more thoughtful in efforts to turn chaos into control when it comes to understanding and managing their streaming data sets. Teams are striving to be more proactive in the management of their mission critical production systems and need to achieve far earlier detection of potential issues. This approach emphasizes moving traditionally late-stage activities — like seeing, understanding, transforming, filtering, analyzing, testing, and monitoring — closer to the beginning of the data creation cycle. With the growth of next-generation architectures, cloud-native technologies, microservices, and Kubernetes, enterprises are increasingly adopting Telemetry Pipelines to enable this shift. A key element in this movement is the concept of data tiering, a data-optimization strategy that plays a critical role in aligning the cost-value ratio for observability and security teams.

The Shift Left Movement: Chaos to Control

“Shifting left” originated in the realm of DevOps and software testing. The idea was simple: find and fix problems earlier in the process to reduce risk, improve quality, and accelerate development. As organizations have embraced DevOps and continuous integration/continuous delivery (CI/CD) pipelines, the benefits of shifting left have become increasingly clear — less rework, faster deployments, and more robust systems.

In the context of observability and security, shifting left means accomplishing the analysis, transformation, and routing of logs, metrics, traces, and events very far upstream, extremely early in their usage lifecycle — a very different approach in comparison to the traditional “centralize then analyze” method. By integrating these processes earlier, teams can not only drastically reduce costs for otherwise prohibitive data volumes, but can even detect anomalies, performance issues, and potential security threats much quicker, before they become major problems in production. The rise of microservices and Kubernetes architectures has specifically accelerated this need, as the complexity and distributed nature of cloud-native applications demand more granular and real-time insights, and each localized data set is distributed when compared to the monoliths of the past.

This leads to the growing adoption of Telemetry Pipelines.

What Are Telemetry Pipelines?

Telemetry Pipelines are purpose-built to enable next-generation architectures. They are designed to give visibility and to pre-process, analyze, transform, and route observability and security data from any source to any destination. These pipelines give organizations the comprehensive toolbox and set of capabilities to control and optimize the flow of telemetry data, ensuring that the right data reaches the right downstream destination in the right format, to enable all the right use cases. They offer a flexible and scalable way to integrate multiple observability and security platforms, tools, and services.

For example, in a Kubernetes environment, where the ephemeral nature of containers can scale up and down dynamically, logs, metrics, and traces from those dynamic workloads need to be processed and stored in real-time. Telemetry Pipelines provide the capability to aggregate data from various services, be granular about what you want to do with that data, and ultimately send it downstream to the appropriate end destination — whether that’s a traditional security platform like Splunk that has a high unit cost for data, or a more scalable and cost effective storage location optimized for large datasets long term, like AWS S3.

The Role of Data Tiering

As telemetry data continues to grow at an exponential rate, enterprises face the challenge of managing costs without compromising on the insights they need in real time, or the requirement of data retention for audit, compliance, or forensic security investigations. This is where data tiering comes in. Data tiering is a strategy that segments data into different levels (tiers) based on its value and use case, enabling organizations to optimize both cost and performance.

In observability and security, this means identifying high-value data that requires immediate analysis and applying a lot more pre-processing and analysis to that data, compared to lower-value data that can simply be stored more cost effectively and accessed later, if necessary. This tiered approach typically includes:

Top Tier (High-Value Data): Critical telemetry data that is vital for real-time analysis and troubleshooting is ingested and stored in high-performance platforms like Splunk or Datadog. This data might include high-priority logs, metrics, and traces that are essential for immediate action. Although this can include plenty of data in raw formats, the high cost nature of these platforms typically leads to teams routing only the data that’s truly necessary.
Middle Tier (Moderate-Value Data): Data that is important but doesn’t meet the bar to send to a premium, conventional centralized system and is instead routed to more cost-efficient observability platforms with newer architectures like Edge Delta. This might include a much more comprehensive set of logs, metrics, and traces that give you a wider, more useful understanding of all the various things happening within your mission critical systems.
Bottom Tier (All Data): Due to the extremely inexpensive nature of S3 relative to observability and security platforms, all telemetry data in its entirety can be feasibly stored for long-term trend analysis, audit or compliance, or investigation purposes in low-cost solutions like AWS S3. This is typically cold storage that can be accessed on demand, but doesn’t need to be actively processed.

This multi-tiered architecture enables large enterprises to get the insights they need from their data while also managing costs and ensuring compliance with data retention policies. It’s important to keep in mind that the Middle Tier typically includes all data within the Top Tier and more, and the same goes for the Bottom Tier (which includes all data from higher tiers and more). Because the cost per Tier for the underlying downstream destinations can, in many cases, be orders of magnitude different, there isn’t much of a benefit from not duplicating all data that you’re putting into Datadog also into your S3 buckets, for instance. It’s much easier and more useful to have a full data set in S3 for any later needs.

How Telemetry Pipelines Enable Data Tiering

Telemetry Pipelines serve as the backbone of this tiered data approach by giving full control and flexibility in routing data based on predefined, out-of-the-box rules and/or business logic specific to the needs of your teams. Here’s how they facilitate data tiering:

Real-Time Processing: For high-value data that requires immediate action, Telemetry Pipelines provide real-time processing and routing, ensuring that critical logs, metrics, or security alerts are delivered to the right tool instantly. Because Telemetry Pipelines have an agent component, a lot of this processing can happen locally in an extremely compute, memory, and disk efficient manner.
Filtering and Transformation: Not all telemetry data is created equal, and teams have very different needs for how they may use this data. Telemetry Pipelines enable comprehensive filtering and transformation of any log, metric, trace, or event, ensuring that only the most critical information is sent to high-cost platforms, while the full dataset (including less critical data) can then be routed to more cost-efficient storage.
Data Enrichment and Routing: Telemetry Pipelines can ingest data from a wide variety of sources — Kubernetes clusters, cloud infrastructure, CI/CD pipelines, third-party APIs, etc. — and then apply various enrichments to that data before it’s then routed to the appropriate downstream platform.
Dynamic Scaling: As enterprises scale their Kubernetes clusters and increase their use of cloud services, the volume of telemetry data grows significantly. Due to their aligned architecture, Telemetry Pipelines also dynamically scale to handle this increasing load without affecting performance or data integrity.

The Benefits for Observability and Security Teams

By adopting Telemetry Pipelines and data tiering, observability and security teams can benefit in several ways:

Cost Efficiency: Enterprises can significantly reduce costs by routing data to the most appropriate tier based on its value, avoiding the unnecessary expense of storing low-value data in high-performance platforms.
Faster Troubleshooting: Not only can there be some monitoring and anomaly detection within the Telemetry Pipelines themselves, but critical telemetry data is also processed extremely quickly and routed to high-performance platforms for real-time analysis, enabling teams to detect and resolve issues with much greater speed.
Enhanced Security: Data enrichments from lookup tables, pre-built packs that apply to various known third-party technologies, and more scalable long-term retention of larger datasets all enable security teams to have better ability to find and identify IOCs within all logs and telemetry data, improving their ability to detect threats early and respond to incidents faster.
Scalability: As enterprises grow and their telemetry needs expand, Telemetry Pipelines can naturally scale with them, ensuring that they can handle increasing data volumes without sacrificing performance.

It all starts with Pipelines!

Telemetry Pipelines are the core foundation to sustainably managing the chaos of telemetry — and they are crucial in any attempt to wrangle growing volumes of logs, metrics, traces, and events. As large enterprises continue to shift left and adopt more proactive approaches to observability and security, we see that Telemetry Pipelines and data tiering are becoming essential in this transformation. By using a tiered data management strategy, organizations can optimize costs, improve operational efficiency, and enhance their ability to detect and resolve issues earlier in the life cycle. One additional key advantage that we didn’t focus on in this article, but is important to call out in any discussion on modern Telemetry Pipelines, is their full end-to-end support for Open Telemetry (OTel), which is increasingly becoming the industry standard for telemetry data collection and instrumentation. With OTel support built-in, these pipelines seamlessly integrate with diverse environments, enabling observability and security teams to collect, process, and route telemetry data from any source with ease. This comprehensive compatibility, combined with the flexibility of data tiering, allows enterprises to achieve unified, scalable, and cost-efficient observability and security that’s designed to scale to tomorrow and beyond.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12-15, 2024.