Traefik API Gateway for Microservices: With Java and Python Microservices Deployed in Kubernetes

Speaker 1

00:00

Welcome back to the deep dive. Today, we are wrestling with, uh, probably the defining architectural challenge of the last few years, maybe the.

Speaker 2

00:09

Decade definitely feels like it.

Speaker 1

00:11

How do you reliably route traffic when the ground beneath your feet is constantly shifting. We're talking about the shift from stable, predictable monoliths.

Speaker 2

00:19

Right, the quarterly release cycle kind of.

Speaker 1

00:22

Thing exactly, to this well sometimes chaotic world of micro services that scale up and down constantly.

Speaker 2

00:28

Yeah, it's like comparing I don't know, a printed map from the nineties to Google Maps in a city where roads just appear and disappear every few minutes and buildings resize themselves.

Speaker 1

00:39

That's a great analogy. And those older load balancers, the ones built for the static map, they just can't cope.

Speaker 2

00:45

They're stuck in that static config mindset. They pretty much melt down when faced with how dynamic a modern cloud environment really.

Speaker 1

00:52

Is and that operational headache. That's why we're diving deep into Treevik today. It's an open source API gateway and it's build specific to handle that dynamic complexity.

Speaker 2

01:02

Right. The idea is to simplify deploy micro services, especially if you're in the kuber eddies world, which, let's face it, many.

Speaker 1

01:08

Are so our mission today.

Speaker 2

01:10

Our mission is to unpack how trific acts as this crucial link. Think of it as the intelligent gateway tier. It connects that volatile ecosystem of services to the outside world.

Speaker 1

01:23

And by the end you listening should have a pretty good handle on the cutting edge of network routing.

Speaker 2

01:29

A shortcut maybe, yeah, a shortcut to being well informed about this stuff. Resilience patterns too.

Speaker 1

01:34

Okay, let's start with the big picture, the monolith problem. We probably all remember it, right, tight coupling.

Speaker 2

01:39

Slow releases, Oh the.

Speaker 1

01:42

Pain, and that really expensive all or nothing scaling. Need more horsepower for login scale.

Speaker 2

01:48

The whole thing, huge waste of resources. That worked. Okay, I guess with the classic three tier model presentation, application data, simple.

Speaker 1

01:56

Enough, but micro services break that model completely.

Speaker 2

01:59

When you shatter that happened to I don't know, dozens hundreds of tiny services. You need a different architecture.

Speaker 1

02:04

You have to evolve the four tier model.

Speaker 2

02:06

Exactly, build for distributed systems.

Speaker 1

02:08

And that fourth tier is where Trafiic lives, right, that's.

Speaker 2

02:11

His home turf. Right, So the four tiers you really need are first, content delivery, the UI, the client stuff.

Speaker 1

02:17

Okay.

Speaker 2

02:17

Second, the gateway tier, that's STRAFIK, discovery, routing, correlating requests, aggregating responses sometimes all that happens here the traffic hap sort of. Yeah. Then third is the services tier, your actual decoupled business logic units, high cohesion, loose coupling, all that good stuff. And finally, the data tier databases, message queues, but now ideally exclusive to the services that own that data.

Speaker 1

02:43

Okay, So the gateway tier is critical, it's the front door. What does a modern gateway like trafic absolutely have to do to handle that chaos in tier three?

Speaker 2

02:52

Right, It's got to be more than just a simple port forwarder. Layer seven routing is non negotiable.

Speaker 1

02:56

Where seven meaning application layer.

Speaker 2

02:59

Exactly routing based on HTDP headers, host names, paths, maybe even stuff in the request body, not just layer four like TCP or UDP ports. And it needs to speak different languages essentially HDTP one, HGDP two, gRPC rest it shouldn't care.

Speaker 1

03:12

And security that feels like a huge piece, especially with all those services chattering away behind the gateway.

Speaker 2

03:18

Oh massive, Absolutely, the gateway must handle TLS termination, you know, decrypting the incoming.

Speaker 1

03:25

Public traffic standard stuff, right, But.

Speaker 2

03:27

Then inside the cluster for service to service chat you need mutual tls MTLs.

Speaker 1

03:33

So both sides prove who they are precisely.

Speaker 2

03:36

It's not just the client showing ID. The server demands ID back, show me your papers too. It's essential for locking things down inside your perimeter if something goes wrong, limits the blast radius.

Speaker 1

03:47

Okay, that makes sense, which leads us right to maybe the killer feature, autoconfiguration. Because, like you said, hundreds of services, maybe thousands of instances, updating config files byhand impossible right there, It.

Speaker 2

03:59

Just doesn't scale. That's where Treyfik fundamentally solves the service discovery problem. Instead of a human editing.

Speaker 1

04:05

A file, which is always error prone.

Speaker 2

04:07

Always, instead, Trephi talks directly to a service registry. Think Console, etca a Kubernetes itself. These things are like near real time databases of where every active service instance lives on the network.

Speaker 1

04:19

Ah, so Triffic doesn't need its own map. It just asks the map maker constantly.

Speaker 2

04:23

Exactly perfect analogy. Treyfik calls these map makers providers. It has first class support baked in, just sits there and watches the provider. A new service instance spins up. Treyfix C is it yep, an old one dies, treyfix E is that too, and it automatically reconfigures its own routing tables crucially without needing a restart or dropping existing connections.

Speaker 1

04:44

Hot reloads zero downtime. That's the dream.

Speaker 2

04:47

That's critical. Dynamic configuration and hot reloads are absolutely key.

Speaker 1

04:52

How tricky is it if you're running say, Docker and Kubernetes and maybe console, can one Trafiic instance watch all of them?

Speaker 2

05:00

Yeah? Surprisingly easily. That's the beauty of the provider concept. Trayfik kind of abstracts away the specific details of talking to Kubernetes versus talking to Console, so you can centralize routing even in a mixed environment.

Speaker 1

05:13

So developers just deploy to whatever platform.

Speaker 2

05:15

They use, and Treyfik figures out how to find it and send traffic there. Developers focus on code. Treefiic handles the routing complexity.

Speaker 1

05:22

Okay, so Treyfik knows where everything is. Now let's talk about actually sending the traffic efficiently. We all know basic round robin, right, just deal them out equally fine for stateless stuff.

Speaker 2

05:35

Yeah, simple, effective, if all your servers are identical, but.

Speaker 1

05:38

They rarely are so weighted Round Robin WRR. How does that work?

Speaker 2

05:42

Right? WRR is about being smarter with resources. Maybe you have an older, cheaper server with less CPU. You don't want to getting the same traffic as your brand new beat cloud instance makes sense, So WRR lets you assign weights. You could say, send three requests to the powerful guests B one group for every one request you send to the older guest D two, a three point one ratio for example.

Speaker 1

06:04

So it's not just load balancing, it's cost optimization too.

Speaker 2

06:08

Definitely in the cloud especially, WRR helps you squeeze maximum value out of cheaper or older instances alongside the new ones. Keeps everything utilized efficiently, saves money, no resource just sitting idle or getting totally slammed.

Speaker 1

06:21

Okay, let's flip that. What about apps where the user's state matters, like a shopping cart stored in memory on one specific server instance, Round Robin would break that.

Speaker 2

06:30

Yeah, that needs sticky sessions. If a user's second request hits a different server, poof their cart is gone or they get logged out. Bad experience.

Speaker 1

06:39

So how does trophy candle that it uses cookies.

Speaker 2

06:41

Typically when the first request hits a back end instance, treefix sets a cookie in the response for subsequent requests from that same user, trific reads the cookie and make sure to send the request back to that same original instance. Keeps a session alive.

Speaker 1

06:55

Okay, sticky sessions makes sense, But underlying all this balancing, you need health right. Making sure you're not sending traffic to a dead.

Speaker 2

07:03

Server absolutely fundamental. You only want to route traffic to instances that are actually healthy, usually meaning they return a two XX or a three X HTTP status code. Anything else is an error.

Speaker 1

07:15

Doesn't constantly poking every instance ad overhead though a performance tax.

Speaker 2

07:19

That's a fair question. It's a trade off. Trefik does use active checks where it sends a probe and passive checks watching responses. But you can figure the interval. You tune it so you find a balance, right, you said it, so. The monitoring overhead isn't painful, but it's frequent enough to pull an unhealthy instance out of the pool quickly when it does fail. It's crucial for the stability of things like round robin.

Speaker 1

07:42

Let's shift gears a bit to more advanced resilience patterns. Traffic mirroring sometimes called shadowing, sounds useful for testing.

Speaker 2

07:49

Oh, it's fantastic for canary deployments, really safe testing. The idea is you take your live production traffic, the real stuff, the real stuff, and you copy a small percentage of it, say ten percent, and send that copy asynchronously to a new test environment, maybe your guess V two.

Speaker 1

08:05

Version, asynchronously, so the original user isn't waiting exactly.

Speaker 2

08:09

And critically trafiic ignores the response from that mirror request. It just fires it off and forgets about it unless you see how your new code behaves under real load stability, resource use without any risk to the actual user experience.

Speaker 1

08:23

That's clever. Okay, so we've handled load and safe testing. But what about when things actually fail, not just one instance, but maybe a whole downstream database or API becomes slow or unresponsive in a micro services world. That seems like it could cause chaos.

Speaker 2

08:38

It absolutely can. That's the dreaded cascading failure scenario. One slow dependency makes its callers wait, they.

Speaker 1

08:45

Run out of threads or connections.

Speaker 2

08:47

Exactly, and then they fail, taking down the services to call them. It ripples outwards.

Speaker 1

08:52

So how does trific act as a ble kid prevent that ripple.

Speaker 2

08:55

That's the job of the circuit breaker pattern. Trafic middleware can implement this. It watches for failures going to a particular back end service.

Speaker 1

09:03

Failure is meaning errors or timeouts.

Speaker 2

09:06

Both typically yeah, if the failure rate or maybe latency crosses a threshold you.

Speaker 1

09:11

Define like too many errors in the last minute.

Speaker 2

09:13

Right, or responses are taking too long. If that happens, Trefix trips the breaker. It stops sending requests to that struggling service altogether for.

Speaker 1

09:21

A period and just returns an error immediately.

Speaker 2

09:23

Yep, usually a five zero three service unavailable. It does this instantly without even trying the failing service. This protects the calling services from getting bogged down and saves resources across the system. It's like the system saying nope, that are closed for now, try again later.

Speaker 1

09:38

And the conditions for tripping. It can be quite sophisticated.

Speaker 2

09:41

I saw yeah. Treefix implementation is pretty powerful. It's not just simple failure counts. You could use expressions like trip if latency at quantil ms fifty point zero hundred meaning the meeting response time is over one hundred.

Speaker 1

09:54

Milliseconds, or based on error ratio exactly.

Speaker 2

09:57

Response cut a ratio five hundred, six hundred point twenty five trip if more than twenty five percent of recent responses were five xx errors gives you fine grain control.

Speaker 1

10:05

Okay, circuit breakers handle the big failures. What about those little annoying transient glitches like a brief network kickup that just needs a quick retry.

Speaker 2

10:14

Perfect use case for retries middleware, Just like getting refresh in your browser when it page times out right, TRIFIC could be configured to automatically retry a request, maybe once or twice if it fails with specific errors like a connection timeout or maybe a five h two bad gateway. It provides a basic level of self healing for those intermitt network blips.

Speaker 1

10:31

Makes sense. So we've got routing balancing resilience. But when things do go wrong despite all this, we need to figure out why. Let's talk observability.

Speaker 2

10:41

Crucial observability isn't just knowing that something is wrong, but having the data to understand why. And TRIFIC, sitting at the entry point, is perfectly placed to collect that data.

Speaker 1

10:53

Across the three pillars right, logs, traces, metrics exactly.

Speaker 2

10:57

Let's start with logs.

Speaker 1

10:59

Now people off and say application logs alone aren't enough in micro services. What makes trifix logs actually useful here?

Speaker 2

11:06

Well, it generates standard error logs, of course, but the real value is often in the access logs. The trick is logging everything for every request can be really resource intensive.

Speaker 1

11:16

Yeah, generates huge amounts of data.

Speaker 2

11:18

So trific lets you filter them intelligently. You might say, only lawged requests that resulted in a redirect status codes three hundred to three h two, or only log requests that took longer than say, five seconds to complete using a mind duration filter.

Speaker 1

11:32

Ah, so you capture the interesting or problematic events without drowning and routine data.

Speaker 2

11:37

Precisely optimizes performance, gets you the diagnostic data you actually need.

Speaker 1

11:40

Okay, logs tell us what happened at the edge, But to follow a request through multiple services, we need tracing.

Speaker 2

11:46

Right request tracing stitches the whole journey together. Each piece of work done by a service is a span. All the spans for one user request combine into a single trace, like.

Speaker 1

11:57

A timeline of the request's life.

Speaker 2

11:59

Exactly, and Trafik being the first point of contact, can generate standardized trace headers, often B three propagation headers, things like XB three trace seed. Think of them like a digital.

Speaker 1

12:10

Passport, and it passes that passport along.

Speaker 2

12:13

It injects those headers into the request before forwarding it to the first back end service. That service, if it's trace aware, adds its own span and passes the headers on. So even if the request hits five different micro services, you can.

Speaker 1

12:25

See the whole chain in a system like Zipkin or Jaeger exactly.

Speaker 2

12:29

End to end visibility invaluable for debugging distributed systems.

Speaker 1

12:32

And the third pillar metrics the numbers yep.

Speaker 2

12:35

Treyfix exposes key application level metrics, things like total request counts, request latencies, average quantiles, error rates, information about the.

Speaker 1

12:43

Back end servers, and you feed that into.

Speaker 2

12:45

Standard monitoring systems, typically Prometheus. Prometheus scrapes these metrics from Treyfi periodically. Then you can use tools like Rafona to visualize trends, plan capacity, and set up automated alerts if say, aer rates spike or latency degrades.

Speaker 1

13:01

Got it? Okay, let's bring this home to the place where treefix seems most popular. Kubernetes. You mentioned earlier that the original Kubernetes ingress API wasn't great.

Speaker 2

13:11

Yeah, it was. Let's say a bit under specified vague, which forced vendors like treyfick in Jinks and others to rely heavily on custom.

Speaker 1

13:19

Annotations, annotations being those kind of messy tech strings. In the Yamo.

Speaker 2

13:23

Exactly, you'd have dozens of vendor specific annotations to configure basic things like timeouts or retries or sticky sessions. It wasn't clean, wasn't standardized.

Speaker 1

13:33

So how did trefik improve on that? They gave up on Ingress in treyfiic v two.

Speaker 2

13:36

They shifted strategy. They embraced Kubernetes's custom resource definitions or crds. They introduced their own resources like ingress, root middleware TLS.

Speaker 1

13:45

Option, So instead of annotations, you define routing rules using these custom but still native feeling Kubernetes's objects.

Speaker 2

13:51

Precisely, it's a much nicer experience. As they say, configuration becomes structured, version controllable Kubernetes YAML, just like your deployment services. Any Kubernetes engineer can understand it. It follows familiar patterns, no more digging through annotation documentation for different vendors.

Speaker 1

14:08

That sounds like a huge improvement. And you also touched on TLS simplification getting certificates is often a real pain.

Speaker 2

14:15

Oh historically it was awful manual requests, validation hoops, remembering to new high chance of error, high risk.

Speaker 1

14:23

So how does trifick fix that.

Speaker 2

14:25

It integrates directly with the ACME protocol, which is the standard let's encrypt uses for automating certificate issuance for public domains.

Speaker 1

14:32

Let's encrypt the free certificate authority right.

Speaker 2

14:35

When in trifick you basically just configure a cert resolver pointing to let's encrypt. Than when you define an ingress route for a public host.

Speaker 1

14:43

Name, trifiic just handles it.

Speaker 2

14:44

It handles the entire life cycle automatically. It requests the certificate, handles the domain validation challenge, often using something called the TLS ALPN zero one challenge. It's quite neat, retrieves the certificate, installs it, and even handles renew before it expires.

Speaker 1

15:01

Wow. So the developer just defines the route asks for TLS and trefick and let's encrypt do the rest.

Speaker 2

15:08

Pretty much focus on the application logic. The complicated, error prone task of certificate management just happens.

Speaker 1

15:15

So wrapping it up, trefix core value seems to be replacing that old, rigid manual configuration world.

Speaker 2

15:21

Which just breaks under micro service dynamism.

Speaker 1

15:24

With the dynamic self configuring system built for that reality. It's the traffic cop that learns the roads automatically as they get built or torn down well put.

Speaker 2

15:33

And there's a final thought, maybe a provocative one, tied to that certificate automation. We just discussed why traditionally certificate management was so painful and manual. People did it infrequently, maybe once a year. This meant certificates were valued for a long time. If one got compromise somehow, an attacker had a year long window.

Speaker 1

15:50

Right. Long lived credentials are risky.

Speaker 2

15:52

Very yeah. But because Trefix integration with let's encrypt automates the renewal process, certificates typically only live for ninety days now, and the renewals automatic, often no human touch needed.

Speaker 1

16:05

So it drastically shrinks the window of opportunity for an attacker using a compromise certificate.

Speaker 2

16:11

Exactly here removes a tedious, error prone operational task and significantly improves your security posture. By enforcing short certificate lifetimes. That whole category of operational security risk just kind of melts away thanks to automation.

Speaker 1

16:25

That's a really powerful side effect of adopting modern tooling. A fantastic insight to end on, Thank you for taking us through this deep dive into trafit.

Speaker 2

16:33

My pleasure is fascinating technology.

Speaker 1

16:35

Then thank you our listeners for joining us. We'll catch you on the next deep dive.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript