Designing your network for fail-over

The do’s and don’ts

An essential part of good network design is to plan for failures. In a RADIUS ecosystem, one major requirement is that clients can always connect to a RADIUS server, and that the system can continue to operate even after the loss of one or more RADIUS servers. This article walks through some good and bad fail-over network designs.

One of our key design principles for network configuration is to always design for the worst case scenario. It is a fact of life that at some point, a network component will fail. There are a variety of reasons for failure, and each of those reasons requires analysis. Designing for these failures is hard.

That is, it is easy to design a system which works when everything is fine. It is much harder to design a system which can gracefully adapt to failures. Designing for failure is just as important as designing for optimal performance, but it is a frequently overlooked aspect of network design.

In our work with clients, we often are asked to review network designs which will work just fine as long as nothing ever goes wrong, or if nothing ever changes. However, they are extremely fragile to any problems that occur (and they will occur). In this article, we look at network architectures which are intended to deal gracefully with the failure of a RADIUS server. Some designs work well, others less so.

Randomly assigned RADIUS servers

One solution for RADIUS server failure is to randomly assign clients to one of several RADIUS servers for authentication. This design does provide multiple servers for redundancy, but it is suboptimal. If one of the servers goes down, traffic is routed only to the available ones. It is also hard to know which of the clients were associated with the offline server and need to be reconnected. When something goes wrong. administrators have a random collection of clients which need to be fixed or checked.

This design adds some resiliency, but it greatly increases administrator load when the network has an issue, which is precisely the worst time to give administrators more work!

Geographically assigned RADIUS servers

An improvement to the above design RADIUS servers is to separate the servers by geographical area. In the US for example, it is common to have separate servers assigned to east, central, and west regions.

Depending on the IP address of the client, clients are configured to communicate with the RADIUS server (or servers) which are responsible for that specific geographical area. The client configuration can also specify a secondary server if the primary is not available.

Geographically assigning RADIUS servers to clients has a few advantages over the randomly assigned solution mentioned previously.

If one of the RADIUS servers goes down, it is clear which clients have been affected and need to be reconnected
Multiple RADIUS servers for each geographical area provides a certain degree of redundancy.

However, specifying the fail-over options in the client has a significant drawback.

The unfortunate truth is that fail-over implementations vary widely by client, and these implementations can often be unreliable. For example, some clients check server availability very infrequently (if at all), resulting in some clients never getting switched back to their original server once the problem has been resolved. Some other clients check for server availability too frequently, leading to clients getting moved from one server to another, impacting performance.

Geographically assigned RADIUS groups

An improvement to the geographically assigned RADIUS servers is to create a local management network or subnet for each geographic area. All of the geographical areas can use the same subnet and list of server IP addresses, as each area does not need to communicate with another area. The servers can share an IP address, using a system like High Availability (HA), or a load balancer.

Each area then can contain multiple RADIUS servers, but these details are hidden from clients. From the client point of view, there is only a single IP address for each geographic area.

Creating local networks for each area has several advantages:

Simplified client configuration and maintenance. RADIUS servers can be added and removed from each area without requiring any changes to the region specific client.

Improved stability of the overall ecosystem. Fail-over implementation in clients is notoriously unreliable. Putting the fail-over logic within the network itself makes the behavior more stable and predictable.

Greater control over fail-over behavior. It is simply a better choice to put network logic within the network itself, rather than in the client. By removing any dependencies on the client implementation, the system administrator can have much more fine-tuned control over what happens in the event of failures.

Creating networks to serve specific geographic areas, and configuring clients to point to the appropriate area still has one significant drawback however. It still requires the clients to maintain some logic about the network. Ideally, we want our clients to be as simple as possible, and put all the network logic within the network. We can improve on this design further.

Single RADIUS IP address

Our last solution completely abstracts the geographical designations of the RADIUS servers and presents only one single IP address to all of the clients. This practice is called “anycast”, and is done by using a routing protocol such as BGP or OSPF.

Putting all the RADIUS servers behind one IP address means that all the clients can be configured in exactly the same way. Unlike the geographically assigned RADIUS groups, the clients don’t even need to know what region they are in so that they can direct traffic accordingly. In this scenario, the clients are simple in the sense that they contain no network information at all, aside from the one RADIUS IP address.

The single RADIUS IP address solution has the following advantages

Standardized client configuration. Unlike any of the other solutions discussed above, this configuration allows all clients to be configured identically. When all clients have the same standard configuration, deploying and maintaining them becomes almost trivial.

Scaling the backend ecosystem doesn’t require any changes to the clients. The backend can be arbitrarily complex behind the IP address used by all the clients. For example, additional RADIUS servers can be added, a load balancer can be introduced to distribute traffic to a sharded configuration, or RADIUS servers can be split between live and historical data. These changes can be made behind the scenes without making any changes at all to the clients.

Improved fail-over capability. Because the routing logic is in the network and not in the clients, the overall system can be extremely resilient to even catastrophic events. For example, if there is a major power outage in the Eastern region, the traffic can simply be rerouted to the Western RADIUS servers without the clients ever noticing a difference. Or in extreme cases where multiple areas go down, new RADIUS servers can be spun up in minutes and all the traffic redirected to the new instances.

Design principles for resilient network design

Our discussion of different fail-over design configurations draws on some of our core network design principles.

Always design for the worst case scenario. Bad things will happen, be prepared.
Keep the clients as simple as possible. Keep network logic in the place most capable of handling it, in the network itself. You can control the network, you cannot control the client fail-over implementation.
Look for ways to centralize configuration. If many different parts of your system require custom configuration, you’re probably making it more complex than it needs to be.
Designing for failure not only makes your system more resilient, it also usually results in simplified overall configuration, maintenance, and ability to scale.

Need more help?

Network RADIUS has been helping clients around the world design and deploy their RADIUS infrastructure for 20 years. We specialize in complex systems and have seen pretty much every variation and problem out there. If you want help from the people who wrote FreeRADIUS, visit our quote page to contact us for a consultation.

Published 2023-05-25 12:00:00 +0000
Categories: articles