¶ intro
Hello everyone and welcome to another episode of Open Observability Talks. I'm your host, Dotan Horvitz, and here at Open Observability Talks, we talk about anything DevOps, observability, and open source. So as I always say May the open source be with you. Open telemetry uh has been gaining wide adoption and many use it in production. We've discussed it at length in the uh in the last episode I had here with uh Johassie.
Uh but as organizations start using OTEL at scale with multiple collectors across servers, containers, edge environments, whatnot, a new challenge arises. Essentially how do you remotely manage, configure, and update this fleet of collectors in a consistent and secure manner? This is where Open Agent Management Protocol or OPAMP as we call it uh comes into the picture. It provides a standardized protocol that lets
Central backend automatically configure agents, uh push updates, monitor the health and collect status information. This makes large scale observability deployments much easier to operate and control. And that's going to be the topic of today's uh episode. So I invited Andy Keller, uh open um uh opamp maintainer and also principal engineer at Bineplane, to join me on this episode to tell us about opamp.
use case architecture, project status, and even some hot updates uh right off of KubeCon Usrail. So uh stay tuned. This is gonna be interesting. And hey Andy. Hi. Great to have you here on the show. Yeah. Great great to have you. So I I tried uh Saying very briefly about about OPAM, but let's start really with the even before OPAM, about the the challenge. What what are we trying to solve? What's uh what has been bothering us to come up with open?
¶ why OpAMP mission statement
Sure. So uh before opamp, um we were actually we probably developed in-house three, four, maybe Agent management protocol. Some were HTTP-based, long polling. We used WebSockets. We use protobufs, we use JSON, we had lots of different strategies. remote configuring um and getting the the health status and um and and other other ma ways to manage agents that were deployed. And um When there was some interest in the OpenTelemetry community of developing a new protocol for agent management.
It seemed obvious that we should migrate to using that. And so we took a lot of what we learned throughout that process. many years of different custom protocols. And um and part of the reason there were different ones, by the way, is we agents we had agent written in um in Go and uh so each time we sort of tweaked it a little bit and um but when we started using the OpenTelemetry Agent as the basis
collectors, um, it made sense to use uh an open protocol that was being developed at the time. Um and Uh I sort of got wind of it uh via our engagement with Open Telemetry. If you're not familiar with our involvement with OpenTelemetry, we've been contributors for a long time. Um probably the biggest thing we contributed was the
Stan's a log processing engine for logs in OpenTelemetry. But we've we've contributed many receivers and processors and other components. And we've been a log involved in the project for a while. So um When when I heard that other people were interested in in an agent management protocol.
was pretty natural to get involved from the beginning um and review the initial specifications and uh a lot of this Tigrin um from Splunk was a huge um contributor to the the very beginning of it and obviously still a maintainer and still very very engaged. But and and I want to really but really uh we we ran already to the solution and the protocol, but I I want to go back to the really to the essence, the the problem of scaling out or operating uh large fleets. So
¶ types of OpenTelemetry Collector fleet deployments
Tell us a bit like what what does it take? If if what what is a typical organization that uh deploys open telemetry at scale, I don't know, thousands.
of uh nodes or g give us a bit of a sense of what uh typical deployment at this scale looks like and then how many collectors would you end up with and sure what's interesting is there is there is no typical And I think, you know, many of your viewers are probably familiar with this, but we see anything from a couple massive open telemetry gateways. where really what you're doing is is managing the configuration of the gateway and doing
sitting somewhere between your edge and your back end. But then we also even see people deploying collectors to embedded devices. We have collectors in point of sale machines. We have collectors um uh on on laptops uh collecting windows events for security tracking. I mean it's it there's the the use cases are are many. And um but for all of those problems a consistent Part of it is how how do you know what that collector
How do you change what it's doing if it's not doing what you'd like it to be doing? And how and um there's a lot of tools for uh automating deployments. um of software in general. You know, Chef, Puppet, and things like that, and Ansible and There's many solutions. Um but what we've found is that often you want to have that thing deployed and then be able to remote
uh how if if it's healthy, uh if it's configured properly, be able to modify the configuration. And and and often it's because the teams what we see in some of the a lot of deployments is that the teams are different. The ones that are kind of responsible for deploying collectors are often different than the observability team who's really responsible for what those collectors are doing. And so this allows you to really segment.
Um interesting so would you say you you gave an example which is very good because like people know these uh like Chef and Puppet and uh you know other configuration management tools.
¶ from configuration management to observability
Would you categorize that as primarily configuration management? Because there is also an aspect of observability, like monitoring the in the the collectors and see how they do. And and so how how give us a sense in in you know the terms that we know from other things in the in the IT industry of what what it uh segments into. Yeah, well it's it's it's interesting you mentioned observability because it'll What we've found with OPM.
developed is is it started to really focus on configuration management and and with uh with agent health and and component health and things like that really moving into this observability for your observability. um because observability is something that uh is so critical to operations. that you need to know is your observability actually working? You know, am I actually
Because if I'm not if I'm not getting any pages in the middle of the night, it's either because everything's great or my observability pipelines are failing. So the actual health of your telemetry pipelines becomes a a big a big piece of that. Um And and I think it's really that kind of real-time control and and access to um when I say real time, I mean we you know we're connected via WebSocket to.
uh hundreds of thousands, millions of of collectors. And um when somebody wants to change the configuration, um it it can happen in span of milliseconds. Um where a lot of typically a lot of that tooling It it's just sort of a different approach of whether you're doing this over a real time connection to the actual binary that's running and could even just dynamically update, for example. Um, in in theory, and we'll talk about that when we talk about roadmap and things like that. But um
But uh that's really the difference. So to really separate the concerns of deploying the Yeah, makes sense. control plane uh of of uh I guess take a the collectors and uh we talked about configuration management was the was the core but then again adding the the uh health uh monitoring and maybe Entire life cycle, right? And in a way. And updates, exactly. So there's a the packages component where you can uh the server can specify what new packages are available and the collector can bring them.
And that includes the collector binary itself. One thing you'll notice about opamp is Yeah it's it's part of open telemetry. It it a lot of the capabilities uh uh seem pretty aligned with OpenTelemetry, but it's not intended to be specific to the OpenTelemetry collector. So the packages available capability, for example, um Really could be used for s a collector that supported plugins or supported dynamic downloading of components. Um uh and um separate from the binary of the
So but but definitely also the collector. So Uh and and we should we should mention uh also uh Kubernetes, right? The opamp bridge is another example just to show it's not just uh the collector, right? Yeah, absolutely. So the bridge speaks opamp, um and it's actually
¶ OpAMP for Kubernetes
Um pretty different the way that works. It's uh really rather than communicating with collectors, you're communicating with this op amp bridge, the op amp bridge. Is communicating within the cluster with the open telemetry operator. And that operator read C R Ds and deploys collectors. So it could be daemon sets or deployments. sidecars, different ways of deploying the collector within the cluster. So those CRDs are something you could just apply to Kubernetes or the op amp bridge allows you to
Those CRDs remotely. So you could deploy a new collector to a cluster if you have the bridge and the operator and the bridge installed.
put a collector in my cluster um or I could tweak its configuration and um that would effectively change the CRD and then it's up to Kubernetes at that point and the operator to manage the library And and going back to the collector, I'm I'm wondering, you know, being the uh the the the techy side of things and taking something that uh I guess in a way wasn't really well when when we built we as the community built uh uh auto collector, we didn't build it with a full
control plane um architecture and and even sure design elements in mind. So I I'm wondering when you when you came to to implement uh opamp in that regard, things such as uh you know uh hot reload all the things how how you get it to to run give us the what it takes to to turn a single collector uh entity into a fleet in in in this respects yeah so it's a it's
We could talk through about the history a little bit, it's because it's pretty interesting. Um, so I mentioned we got involved really early with the specification and the uh the implementation of the library and opamp go, and our initial use for it was really. connect our distribution to a management server. So we took our distribution, we modified some of the bootstrapping and made it so that our distribution
And um as we were doing this, and and and the way we did it is kind of the way we did it with other collectors we had implemented in the past. I mentioned we had a Java collector and a go. uh agent that we built um was sort of at the bootstrapping level to um connect to the server and when you saw a change. basically just reload the configuration.
dynamically. And so the way that we did that in the OpenTelemetry Collector is by sitting in front of the service. So the normal kind of command main would launch the OpenTelemetry service. We sat in front of that. We launched it and we actually shut it down and then relaunched it when we saw configuration changes. What was interesting is for the first year it was kind of whack a mole of oh the you know Prometheus port was open.
And never got closed on shutdown because nobody really intended anything to happen after shutdown. You know, shutdown was you know, it's nice to fin the final job. But really we're gonna shut down this process anyway. So Um so we had to be really deliberate about, you know, clean cleaning things up, closing ports that were opened. Um, and we we just kind of went through this process of figuring out all of the different um
you know, issues there were with with hot r hot reload like that. When I say hot reload, it's really shutting down the pipeline, draining on, you know, draining You shut down the receivers, you you know, let everything flow through. The same way it does a clean shutdown now, but it's waiting for that and then and then building a new pipeline afterward.
So that was really I I think probably the first real kind of commercial or or you know, use of of op amp in in the project.'Cause at the time it was something that we were interested in and Tegrin was interested in some other people and um you know over time there's companies involved in this. But we we just kind of put it in our our distro. And then Then um Uh there was some work to build the op amp bridge for um Kubernetes. Um and uh
So that was really probably the second use of op amp, but the first one that was really part of the open telemetry project. And it really wasn't until um probably like twelve eighteen months ago that um we have the architecture with the op amp extension and the op amp supervisor And these are in the contrib repo for the OpenTelemetry Collector and are really kind of the the way things will be done in OpenTelemetry.
Um, I say the way they'll be done. We'll talk about roadmap some more, but that's also there's other opinions about how this could be done. Um there may be other implementations as well, but um but that's really kind of the idea of um of how this works in in the collector. So so let let's really delve a bit into that. So first of all, there is the protocol that you mentioned and there's the sort of implementation element.
¶ OpAMP protocol and components
This give us this distinction between what's the op amp as the the protocol, what's the implementation, and maybe a bit about you you threw a few names of of components, maybe a bit the component architecture so that people can start getting a sense of the project. Sure. So um these are both in in GitHub under the OpenTelemetry project. There's the opamp spec repo, and that really just is the specification.md file. It really is a description of the protocol, all of its capabilities. Um
You know, the there's there's really just um two messages. There's a server to agent message. agent to server message and it describes all the components of those messages. There's protobuf for those and uh and like I said, really a description of this specification, not no implementation.
could be implemented in um you know any n any language the supports protobufs which yeah once you have protobuf you can generate your own SDKs your own clients it's uh sure it's open-ended so then the related repo is opamp go and that's the go reference the specification and that includes both a server package and a client package. The client package can can be used in um the collector and it is used in the collector.
And uh it also used in the supervisor and I'll talk about what those actually are. But um so there's a client implementation and it actually does a a decent amount of work in terms of kind of state. When you sound it send down a configuration, it determines if that configuration is different and it has a callback mechanism where it reaches out to the collector and says we've got a new config. And then the collector responds and then the client
You know, constructs the appropriate response messages. And um, so there's it's it's it's uh I wouldn't say it's a heavy client, but it it it does a decent. Um the server implementation is a little thinner. It's really um it gives you some callbacks. Uh when you receive messages over the WebSocket or V HTTP. But they're pretty simple. It's like, you know, on connected, on message, on connection closed.
I think there's an on-error callback. You know, it's not, it's not like there's a whole lot going on. It's a little bit of an abstraction from you know a raw WebSocket connection. Um so those both live in the op band. package and then that package is used in the uh op amp uh extension and the um up amp supervisor and so the extension is an uh uh an extension as as the name would suggest, a component for the collector. And it implements sort of the read-only
So it will report the current configuration, it'll report health status, it'll re it, but it doesn't accept any changes to configuration or any changes to packages or anything like that. It's strictly read-only. Um, and so that That component can be configured as part of the collector. As part of that configuration, you say what remote server you're you're talking to. And you could have with just that extension.
A management server that really is more of an observability platform because it's really just you can see the configuration, it can see the health. could see that it's connected to the platform, things like that, but it's you can't really change anything. The supervisor then is the thing that implements read and write. The the architecture is pretty interesting in that it implements or it uses the op amp go library as well. And it has both an op server and an op-amp client.
The opamp server um communicates with the extension in the collector. So it kind of sits between the management. Platform and the collector. Uh, it speaks to the collector on behalf of the management platform and it can accept changes. From because it's a client to the management platform, it can accept changes. And what it does is a separate binary that will actually write out a new config to disk, shut down the collector.
restart the collector with the new configuration and it does some other things like make sure the collector starts. If it doesn't start, we'll revert the config and run with the last known good config so that we're not breaking your telemetry pipelines remotely because that's part of this is to uh to be resilient and um to be able to manage this remotely you want
Still running. Um so um so underneath the hood it's ultimately a restart. So it's loading a configuration restarts. There's no uh sense of real hot process. And again there's other ways to do this, so we'll s we can talk about those things, but uh but that's that's the that's a current architecture and then that's the as I uh we said this is
Separately from the spec. So the spec uh defines the the protocol essentially, as you said, like people know I don't know OTLP as another protocol by the under the O. So this is the pure protocol and then there's the uh opamp go and these uh additional components so all of these of course under the opamp uh git exactly and the and i wish you mentioned that the the the bridge um in in Kubernetes the The op amp bridge also obviously uses the op ampco library.
So the library is used pretty heavily within anything in open telemetry that's using these components. Um one thing I would just want to say about this supervisor and extension architecture, one thing that's really nice about it is it doesn't really disrupt the collector core in any way. Um it's a separate process, right? It in all separately from the collector.
And it relies on an extension, which is just a component that the collector can optionally use. So it's not really d you know, for better or worse. 'cause there's there's other, you know, cool things we might be able to do, but it's it's really not digging into the internals of the collector or the internals of the pipe. It's really kind of operating at a higher level, which allows um all of this capability to exist with the collector that you know didn't really require
any changes to to make this possible. So so that's pretty powerful. Um and I think part of If if we kind of go back to the beginning of that that history, when we were originally working on this, there was some interest from the community, like, oh, that's cool, but not Lots of people are implementing OPAMP in their telemetry backends that they can remotely configure agents and collectors and um and there's a lot of enthusiasm in the community for for
So that's that's why we're seeing more adoption in in the OpenTelemetry upstream. And then I also think that's why we'll see some other things in the next year or two. uh we'll we'll really dig in a little bit more and pro and unlock some more capability. So nice nice and maybe a word about the What it looks like in terms of the uh um deployment architecture. So how it's it's sort of as you said there's a separate process that sits a l uh alongside the
The collector, but how do you deploy it? How do you deploy them maybe together? What's the relation? What are the different options that we have there? Yeah, so you would deploy the supervisor and the collector together, kind of one to one.
um the uh the the supervisor will automatically configure the collector to speak to its uh server so i mentioned the supervisor has a server the collector extension um is an op client and so when the supervisor receives configuration it will add in the configuration necessary for the op-amp extension to communicate with the supervisor. So, you know, typically you would you would deploy that um maybe with some base configuration.
that you want to start with and because that'll be reported up um as the effective configuration for the collector. But then you would start changing Um there's a couple different ways to package it. Um if you're using um you just sort of standard package uh management tools in the operating system. um usually deploy the supervisor separately. But there's and and point that you you need to reference in the configuration of that the binary of the collector.
Normally those things would be in kind of standard locations, but um but there's there's different ways of of doing that. Um then in Kubernetes in Kubernetes, when you're really in your containerized environment, really that's where the the break. um and um and the operator really being in charge of deploying. And there it's it's pretty different because like I said, we're um the the operator is capable of deploying, you know, you create the CRDs and that will deploy the collectors.
And the supervisor's not. Okay. But anyway, it's the the option of the Kubernetes operator, it's important to uh to know that and uh and in terms of the the models, a lot of uh whether it's deployed as a as a daemon set, sidecars, all the uh the the different options that you have there. Yeah, those are all options in the CRD.
So C R D's it's just it's a custom resource definition. It it uh Uh there's one just called collector and it uh it has in the spec it has the config, but it also has a bunch of like a mode which indicates uh how that should be deployed. And that's where you say if it's a daemon set or deployment. And then kind of all the normal Kubernetes things you would expect. You know replicas. um autoscaling and you know image pull policies and credentials. Yeah, you know, basically
Yeah. And then you so you mentioned the the Open Go as uh the the reference implementation. I'm curious.
¶ OpAMP for remote management of OTel Java SDK and other agents
Uh are there any other uh interesting implementations of the protocol uh that's it? So um yeah so It actually started a Trying to think back like three years ago there was a Java implementation. Um and that got resurrected. I don't know if it got r fully resurrected or rewritten, but um About a year, maybe a year and a half ago, um, as part of the Java STK. Um, so I don't know exactly the maturity of it. Yeah, but the the Java S D K as I understand can
right now speak op amp and receive remote configuration. Um there's limits on what can be configured. Um and and it's really a you know you're kind of in a different situation when you're in SDK because We're not gonna shut down your applications to reconfigure them, you know. So so it really has the SDK has to really do the management of the configuration and it has to be a hot road reloading. Um but that uh so um I don't know a lot about it, but if if you want to look into it.
if somebody's interested, uh the the Java S DK. Uh the Java SIG, I think that'd be the perfect place for folks. The the Java folks out there, if you're interested, uh the Java SIG would be the right place for you uh to uh to join in the conversation for this and other things that related to the Java SDK. Um this is about the same thing. I think we'll see that um continue uh in in the future to other SDKs. I think this idea Um you know you can just think of so many some some
Use cases are somewhat trivial, but let's just take sampling for example. You might just want to change your sampling rate. Um or you might be trying to in investigate an issue and you want to turn on debugging. Um it's pretty powerful to just be able to do that. And across the cluster again, uh doing it at scale. So if you're your a microservice application with multiple tiers and I don't know, uh data stores and and whatnot in in your architecture, you need to do that now across the the whole
uh system across the tiers. Uh this is something that's very powerful to do that in a in a click of a button. Um and also I think it's it shows something going back to a note that you said briefly before, but now maybe it People to understand that better. That the protocol is larger than just the collector. So the collector is the primary implementation, but the protocol is meant.
to be more than that or more g generalized than that so that it can serve other types of agents. This is why it's called the agent management uh rather than the Oh, to collector management because it's yeah, absolutely. Agent could also be SDKs and uh and
potentially I think the protocol can definitely uh be generalized to uh to other types of agents as well, right? Yeah, and that's and that's really As I mentioned previously with the messages being s kind of generic, um you know, it it it's not uh we don't We don't we're not opinionated on the format of the configuration. Um, you know, we know what a open telemetry configuration looks like. An SDK configuration.
So um, you know, that's just within the open telemetry community, you have different configuration. But then you might have other agents that configuration styles and you still want to be able to support them. So the the remote config uh message from That's defined in op amp is really just a map of name value pairs where that value can be anything. And then it's really up to the agent to just see, okay, I've got a new value. What is this? And um
You know, like the bridge sees a value and that value is a CRD. Um, but the uh the supervisor sees the value and the value So um but I think we would love to see other other agents, um whether they're security agents or other um observability agents.
in the marketplace. Implement op amp. It's definitely not intended to be an open telemetry specific protocol. It just really came out of some interest in this community and it was a convenient place to house the project, but uh really intended to be uh agnostic from from the agent.
And and it's interesting because I think keeping it generic enough uh is not just about uh uh generic payload because the sense of uh configuration changes between uh between components, between let's say S DK and controller. Also the the approach uh for uh loading it technically.
is different, right? When you uh you have an SDK, for example, uh you talked about supervisor doing the this restarting or shutdown. Obviously this is much less applicable for for SDK. So it's also in the terms of the mechanisms being applied, you In order to support a variety of types of agents, you also need to support a variety of of mechanisms, right? Exactly. So today for example for the SDK so it it works differently than what you what you described about the supervisor probably is not
the way that you do it with the SDK, right? So just to make sure for those worried, worried SDK developers out there, it's it's uh the mechanism behind it is is is different. Exactly. Um Yeah, and I guess the one one thing I should mention is uh some of the um auto instrumentation support available. the Kubernetes operator is really powerful. So um and that's um that also allows you to implement observability for your applications.
a really lightweight, non intrusive way. So and that can also be configured. you know, via the Operamp Bridge and configured remotely. So um while we'll see SDKs themselves start to um speak op amp more, there's already this ability to to inject um
auto instrumentation, which is powerful. Yeah. Yeah, that's that's nice and also good to know that uh so it's it total so we have the uh the OBI essentially so today it it has the interoperability with OPAM just I Uh well the the the Kubernetes operator is capable of doing that via the Okay. Makes sense. Um and and uh another question about uh so We talked about this the uh server uh side and and uh the I I'm wondering are the different
¶ server implementations for OpAMP
server implementations out there? What what what can we see? Uh obviously you you you have one, but just to understand what what can people find out there when they look up OPAMP uh implementations, I guess. Yeah, I think um There's there's a couple open source projects that are pretty small and I think um uh not really full featured but uh i think there's some interest um i've
It it comes up in the op amp sig quite often. Some interest in developing a uh reference implementation of the server. Um I think I would almost compare it to um if you think about open telemetry as a collector of telemetry. Um, and also a way of implementing pipelines for telemetry. Um there's not a backend as part of the There are lots of open source backends, but this is the way we send telemetry data to other things.
Um I really see the OpAMP project as a way we define how agents can be managed, but it isn't part of the project to manage them, if that makes sense. So I think we'll probably see some of those servers developed. Um uh I do know that there are uh commercial back ends that speak op amp and are implementing that within their their platforms. Um I think we might be the only vendor neutral. Um you know bind plane is kind of unique in that we really manage telemetry pipelines and don't
uh we we send telemetry to any back end. You might send some to one company and some to another company and some to you know some other storage or something like that. So um But um there is a very small server example in the opamp go repo. And it's really just kind of to give you an idea.
how this works and to explain how you would implement a server. I think there's a lot of It it looks really simple on the surface because it defines sort of a contract between one server and one agent and how I would deploy a configuration to. Um the protocol isn't opinionated about if you have a hundred thousand agents, how you deploy that. Um configuration to a hundred thousand agents, do you do?
One at a time? Do you do ten at a time? Do you, you know, do one, then ten, then a hundred, and and exponentially grow until You've deployed it to all of your collectors. I think there's a lot of um nuance in in how to do that effectively at scale. And uh So so how do you see people um I'm not talking about specifically now customers, but in general when you see people even coming to the to the SIG and and showing interest as end users?
¶ adopter vendors and end-users
How do they consume? What do they do? How do they get started? So they get excited, the fact that there is standardisation around that. But what's the next step? Give us take us to the to the really the the first steps for people starting to to play around with it and and to try and implement that. Yeah, I mean I think I think to be
You know, to to be honest, the moment we're s what we're seeing mostly is vendors in you know in the SIG meetings. Um, you know, Elastic is there and Dynatrace and Google and Splunk and Honeycomb and I'm probably leaving lots of people out. Um but but we see
An impressive list and it's great to see that uh that backing of of so many important uh significant vendors. So it's it's great to see that. Um yeah, but there isn't a lot of I think I think in terms of end user usage, a lot of it is really um I I would say It it's defining a capability for the for for collect collector management, for agent management. Um, but for an end user it's just going to be a capability that they see in the platform that they're using and they are
um they aren't really thinking about adopting and implementing themselves. Now one kind of fascinating uh um You know, caveat your c maybe that's not the right word, but um at uh a KubeCon Nike did a presentation. You can find it on YouTube. Um, they implemented their own app amp backend.
I thought that was pretty cool. So um yeah, they read the spec. If you're big enough an end user, you also do that yourself. And I have seen that with you know we've we've talked to companies who have very special use cases. We have some companies that are using our platform, for example, really as just an API layer. So they've implemented their own management.
um UI and kind of experience. Um, but they're doing the management via our product, via Um and the reason we're kind of sitting in the middle is because of a lot of those things I mentioned is doing this at scale, um supporting, you know, rollouts in in different ways and um uh you know, and and helping with the configuration process. There's a there's a lot you can you can do there.
Um, I I want to ask maybe about also the can you tell us about the state? What what's the maturity uh state of the project in terms of the the scale, the CNCF uh scale?
¶ project maturity
Yeah, so the spec is um officially beta. Um and some the way we decided to do it is if you'll see in the spec, if you read through the spec, different messages. Um now I said there was really only two messages. There's a server to agent message and an agent to server message. Well within that message you have
sub-messages and each of those sub-messages has a different maturity level. And so some are um in development and you know I would say the like the config management and the agent description that kind of just describes Some agent um metadata, those things are are beta. Um opco implementations also beta. And then I think um I'm not sure what
I'm gonna look real quick because I'm curious, but the the uh op-amp supervisor and extension are considered at this point. Um but I would say they're they're fairly new. Um and uh No, it's okay. We don't have to uh but in general I think it it gets a sense of the uh maturity of the project. Which is considered alpha at the moment, is what it's
Sounds good. Um and and I have to say I uh came back from uh hotel unplugged Europe that uh took place uh in uh February, just uh Uh recently, and it was very interesting for me uh to see that OPAMP was brought up, and there was actually uh enough. interest to to merry the session. So it's it's a non conference, so the sessions are determined by the interest uh expressed by the by the participants. So it was very interesting to see that OPAMP was uh
was w brought up and and drew enough attention. So I think uh first of all, very, very encouraging. I was uh I was uh very happy to see that and also um And I think interesting points that came up. I know that you you weren't able to join, but I'm wondering if there's anything that uh drew your attention afterwards uh in terms of the discussion that took place. Um yeah, I wasn't I wasn't able to be there and um I will be a KubeCon in
Amsterdam, so looking forward to that. Um and I was in Atlanta, so that was that was great. I think um you know we're seeing a lot of interest. I gave I gave a talk last year.
scale about operating uh op amp at large scale. Um uh and that was kind of a successor talk to one I gave a couple of years ago in Chicago, which was really just introducing Yeah, those are both on YouTube, by the way, if if anybody's interested in learning more about um the you know, go through slides of of what the messages all look like, etc. But um, you know, we're I I think People have really come around to this idea. is is a critical at scale.
um tens of thousands or hundreds of thousands of collectors and um and I think they're excited about being able to observe their their telemetry pipelines from um from a single pane of glass as as opposed to um you know a bunch of files in some git repo that they're expecting their CI process to automatically deploy for them. It's nice to just sort of see things operating and um What what did you what kind of questions were were you hearing?
¶ protocol vs. product
A lot of that was just about putting it to practice. We talked about it before, and you said that it's part of the definition of the project. It's the uh protocol uh and some reference implementation, not something that you should download and start uh running.
When people want to start using they they're looking for some sort of a no pump server uh and and that they can they can use or you know, also in terms of the be being able to uh use a UI, all these things that are more about the usability which is It goes to the productization of things.
Right. So There is some sort of a setting of expectations. Some of them is more about I guess uh getting started guides, thing that can the as a is an open source project and and I'm sure that it will be part of that. requirements for for maturisation of the project in the in the C N C F uh uh scale. But these are the things that usually the project uh will
uh included will need to uh enhance. So this is the the limit the I guess the where the line draws between uh enabling the the getting started and and all of the documentation around that and samples and all that and and the other part is uh something that I would I guess call productization that uh obviously when when they reach this point they need to decide whether they do it uh themselves or whether you do it into someone who productizes for
It's a it's a it's a great kind of distinction of the product versus the protocol. I mean if you think about let's think about HTTP as a protocol. Right. There's there's no product there. There's you need a server, you need a browser, you know. Uh um exactly. So I think uh you know, to the extent that um we've been able to Put the the collector as
I think is has been the focus at the moment of like getting um and and like I said, we'd love to see more collectors. Um, I think really anything out there, especially with a supervisor. I think the supervisor is a model that because it's just right. and restarting a binary that binary could be anything right so we can easily see other collectors um out there speaking up amp. Um on the on the back end side of things, I I think surely there'll be s there's there's interest and I've heard people
you know, looking for a back end and um and I I'm I'm sure there'll be will be some built. Um And uh but but yeah really the the project within OpenTelector.
is focused the spec itself, the reference implementation, and then how do we activate this within the the rest of the open telemetry community when the collector and the STKs and the operator um and and I really kind of you know, try to play both sides where it's yeah, I'm I'm mostly focused on the protocol, but I'm very keen on getting the implementation into
And I also want to support vendors that are looking to implement this in their platforms. But really I'm I'm sort of laser focused on protocol, its capabilities, what are we missing? What can we what can we add to it? Um one thing that was um I'm I'm really happy with how it came out about about a year. maybe two years ago we added a concept of called custom messages and it's really a way to extend the protocol.
so that you can make this protocol do um you can use the protocol to do different things. So um we'll see you know vendors in their own distros implementing custom messages. There's actually a I think it's an S3 uh exporter that actually I can't remember if it's the receiver or the exporter that speaks custom messages um over op that'll actually give you the progress of I can't remember if it's reads or writes to S3, but um you know somebody implemented that.
within their complaint. That was pretty cool. Yeah, yeah. to send and receive custom messages so that any component within the collector can use that hook. to then send and receive custom messages. So if you have a custom component that does this particular thing and then you have your server, you know, knows to send these special messages, they'll get routed through the extension to your component and routed back out. And it really opens up a lot of pretty cool
Nice, nice. And I g I guess talking about uh extending the capabilities is a good uh segue to uh looking looking forward, I think.
¶ OpAMP Gateway launch
Now as we're heading towards uh KubeCon Europe, uh you're going to uh have a big uh big launch ahead of uh KubeCon, hot off the press. So yeah, the big thing about Opham Gateway. Do you wanna say a word? Yeah, so so one of the things I've been working on is
This uh thing we're calling the op amp gateway. Uh it's currently implemented uh as an extension, um, although it's probably gonna be refactored into a library that could be deployed as a binary or something else. But what it what it really provides, if you think about uh a gateway deployment of a collector is effectively re relaying OTLP. Have a lot of edge collectors and they're sending um into another collector via OTLP. Then they're doing some processing in the gateway, then that is sending.
One or more telemetry backends. What the op amp gateway does is really similar in that you might have a lot of edge collectors. that wanna speak up amp and wanna wanna be remotely configured and wanna report their health and um but one of the challenges today is if I have all of these And they're connected via OTLP to this gateway, and they're sending telemetry. Um, they might. not be able to see outside the network. And what we require right now with op amp is that each one of these edges
connect to the management platform directly with a web socket. And so what the op amp gateway allows you to do is that be the server, the op amp server for all of the Uh and then that maintains one or more upstream collectors. And you can configure how many upstream collec I'm sorry, not collectors, upstream uh WebSockets. And you can configure how many WebSockets you want to use based on
How much traffic you have and how many op-amp messages you expect to be going going across. Um, but instead of let's say I've got a hundred thousand collectors. Across my many different clusters in my organization. Instead of all 100,000 connecting to the management platform, I can deploy a hundred
Op amp gateways have a thousand collectors connect to each one and then those uh hundred connect to the management platform. Um it really helps with one of you know one of the limitations of one of the the the scaling challenges of a management the management back end is really just maintaining a ton of web sockets you know if i if i have a million collectors out there now i need a million web sock
And so this this really introduces the ability to um multiplex these messages through um the gateway. So this is something that um I've been working on for a few months and um it'll it'll be in our uh uh uh in in a contrib repo of our distro and if if people are interested we'll we'll be happy to donate it upstream. Um but uh that's where it'll start out. So and um yeah we should be Should be ready within the time this is announced. So I'll give you a link to it so that people can find
Yeah, for sure. We'll put it on the uh on the show notes. Uh and you know, we we you you mentioned some figures, just to understand, we started by even the the the OPAM protocol itself by scaling challenge and now you're talking about
¶ Scale of OTel Collector deployments
another order of magnitude or more of of uh scaling. So what what kinds of scales g give us a sense of the deployments that you see out there at scale for millions. Millions. I mean if you think of so you start thinking about like an IoT. embedded space. Um some of the bigger ones we see are you know, security, um
you know, use cases with like workstations and things like that. We see point of sale machines I mentioned. Um but um when you when you kind of get into the embedded space, that's where you really see Like millions of being being a number that uh you need to start reasoning about. Um so and yeah, it is, but it's uh And and so again, you know, that that this this gateway will uh not require the
Millions of live WebSocket connections. So that's really where we see some scale challenges. But it's the WebSockets are great for low late. you know when they're dormant they just kind of sit there. It's uh it's a great way to to communicate remotely um with the agents. Um um we all the the the op amp protocol also uh allows for HTTP
Very similar in terms of the protocol. The way the messages flow is really similar, but it's you know a kind of an HTTP message uh uh up to the server, the server can respond. send another message and um you know the disadvantages the server can only send a response when the agent sends a message. So
server can't really push, but if the agent sends messages pretty frequently, then you get something pretty close to push with a little little latency. But um Yeah, yeah, and then I have to say also zooming out, it shows I started in in my opening uh notes about the maturity of of open telemetry as a project, and I think seeing this the uh the project not with open as as the the the enabler but uh
In general, seeing OPAM deployed not just in production, but in production these scales give a sense to those who are still debating whether the project can can sustain production workloads. I think this These figures that you just described, uh probably can't name names, but there's a good chance when you're checking Shopping, you might be using an open telemetry.
You know, an open telemetry collector might be involved. It's it's pretty wild. Um it makes a ton of sense. I mean the observability space in general. Really grown over the past 10 years since I've gotten involved in this space, and um, and it's uh It's just... You know, I I think people really recognize how critical it is to understand what their workloads are doing. Um and uh, you know, gone are the days of a black box with a blinking LED on it. We we need to see.
Yeah, yeah. That's why I called my podcast Open Observability Talks. You know, it's all about open observability. That's not blah black boxes and it's not uh proprietary. That's uh that's the way to go. And I I want to uh maybe as as a Important note uh towards the end is about the roadmap. So this is a major release. The OPAM gateway I think is uh as you as you explained very nicely and this is uh
It's it's coming your your way. Probably the the link will already be in the show notes so we people can try it out. Uh beyond that, what's what's up in the roadmap for OPAP? Oh boy, there's so many, so many things. So um, you know, I mentioned uh the the um probably some more SDKs.
¶ OpAMP roadmap
you know, adopting op amp. We'll see, we'll see how that goes. I think there's there's a lot of discussions in that space. Um I think there's different flavors of sort of hot reload and partial. reloading and there's some actually at the op amp SIG meeting today, which we uh while we're recording this, um we had a two hours ago we had our op amp SIG meeting and um there were actually two different proposals for different ways
of configuring agents. One was really about um sort of a partial reload of pipelines. So can we just We have ten pipelines in our collector. Can I and I only change the configuration of one? Can I leave the other nine running and just change that one collector so or that one pipeline. So that would really be uh a change within the core collector. Um, but that would then naturally uh lead to op amp um managing that change. Um speaking of the change, I think one of the things we'll see is um
Of diff support with configuration. Right now, the way op works is you send down a complete configuration and um We've seen configurations get extremely large. So it gets to the point where you don't really want to send the entire configuration down when maybe a sampling rate changes. So um and I think that also kind of speaks to that partial reloading. And then there's also, you know, hot reloading is something that there's been um
depending who you ask and when, you know, varying interests in the community. And uh um, you know, it does it does respond to a SIG up now and reload the configuration. which is pretty cool. And so there's been some interest as well in um taking kind of an approach like we had in our collector. of actually embedding in the collector and not requiring a supervisor um and actually doing op amp um with that sort of hot reload. There's a really cool OTAP right now. Um
In the works. The title of it is telemetry policy. Policies are intended to be slightly different than configuration. So policy might be when I see this kind of log message, filtered out, or when I see this kind of log message. Um something like that and how that actually results in conflict Is you know
kind of remains that that's kind of up to the implementation, I guess I would say. So so then you communicate policy and maybe the SDK would implement that policy one way and the collector would implement Policy a different way, but the policy itself becomes the vehicle for communicating the intent. Um, and naturally, you know, you'll see.
want to be a way to deploy those policies. So um so any anything that's really sort of um uh configuration adjacent I guess is and and um life cycle adjacent and and and um Uh let's see. I I think um there's also an effort something we talked about at length today was you know what our roadmap to stability looks like. So um
getting the the protocol stable, the reference implementation stable. And obviously this is happening throughout the open telemetry community with the collector itself and different components. Yeah, for sure. Um be getting the extension stable and the supervisor and the the out amp bridge, so there's lots of different Great, great. So I I think it's uh as as we're uh coming to uh to the time.
Uh maybe for the last note, where can people um follow the project, get involved, catch up on the on the roadmap, uh chime in, try
¶ where to follow the project and Andy
So really a great place to start is the CNCF Slack. So um, you know, as as most most of the your viewers probably know that OpenTelemetry is a second largest project and the Yeah. What's the name of the uh the main channel? Give us just the names of the channels?
So simple enough. And and for those who are not yet there, just know it's open for everyone. You don't need to be a paying member or any any other thing. And if you go to cncf.io and you look for the uh Slack invite, it's very simple to join the instance and then You open your uh your profile and then uh you look for the interesting channels, in this case the uh uh hotel dash open.
Um and also where can people maybe uh follow you? Uh obviously you're there on the Slack, but uh other than that, how can people reach out to you after the episode? Yeah, so uh my my GitHub handle and my Twitter. or E at the end, so A and D. I thought that was really awesome once upon a time. Um partly because Andy Keller was taking.
Flickr and Tumblr and all the cool internet things and uh unfortunately people put the E in all the time. So if you wanna f if you wanna find me, don't put that last E in. Um so uh and then uh And and and I would encourage you to go check out the repos as well. spec and opamp dash go both in the open telemetry github organization. um you know, open issues, start discussions. Um we we uh certainly um welcome everyone and uh we had a good turnout at our SIG meeting today. We
Our SIGs, our SIG meets every two weeks, Tuesday at noon Eastern. So we try to position it in a way it's maybe a little inconvenient for my lunchtime, but it's very convenient for Europe and the West Coast. Sounds good. Yeah, and uh well maybe l last north we mentioned that briefly. So uh we're coming up to uh KubeCon Europe, uh taking place in Amsterdam um ma March twenty third through uh twenty sixth. So
A great uh place to uh see Andy. I'll also be there if you want to see me as well. But uh definitely a good place to catch up with uh with the uh OPAM uh folks. With other folks of the uh open telemetry actually there's an observatory booth so a a mega booth there that you can uh find uh specifically for for everything hotel. So uh uh hang out there and you'll find everyone who's everyone is involved in anything.
in hotel there. Um and there's also uh again since we're at Open Observability Talks, there's going to be uh amongst the uh different co-located events on the first day or day zero as they call it. Uh there's also observability day, I'll also be there. I'll be on stage also stay. My favorite day, highly recommend that.
So highly recommend. It's a good warm-up to the main KubeCon. So come a day earlier, join the the collocated event, specifically the uh observability day. Again, a great place to uh hang out with folks that are observability uh professionals. Um and with that, I'd like to uh thank you again, Andy, for joining me on this episode. Yeah, thanks for having me.
¶ outro
Yeah, it was great. And thank you, of course, for all the our uh viewers and listeners for joining us on uh today's episode. Uh as always, all the episodes are available on your favorite podcast app or on YouTube, so uh you can check it out. And of course, we'll put on the show notes the references for Andy and of course the show. You can follow the show on Blue Sky, on LinkedIn, on X.
uh at uh open observe uh to uh catch up on the episodes, to carry on the conversation, uh to ask your questions, to share your feedback. And see you on next one's episode. Until then, may the open source be with you.
