8 Most common RADIUS mistakes

And how to avoid them

We see a lot of questions on the FreeRADIUS mailing list and from our clients that boil down to “RADIUS isn’t working. Can you tell me why?”. Most of the time, the problem isn’t with RADIUS itself, it is with the supporting infrastructure, or with less than ideal network administration practices. Many of these issues aren’t necessarily self-evident so we wanted to outline the most common mistakes and tell you how to avoid them.

The good news is that most of these practices don’t cost much in terms of money or time to do correctly. The other good news is that they can make the difference between catastrophic failure and a minor inconvenience. The bad news is that most people don’t follow these practices.

1) Not properly resourcing your RADIUS infrastructure

We’ve said it once, we’ll say it again:

RADIUS is a key component of your critical infrastructure. You should treat it that way.

We often see RADIUS on hardware (or VM) that is also running many other things. Put bluntly, this is a bad idea. Putting other components on your RADIUS box increases the risk of library collisions and incompatibilities, and can starve RADIUS of system resources.

We recommend running RADIUS in a VM, by itself where it uses 5-10% CPU capacity in everyday operations.

Some system administrators consider this wasteful, because there is so much idle CPU time. That’s when they start installing other components onto their RADIUS system.

When everything is running well, loading up your RADIUS machine with other components will not cause any problems. But if there is one thing we know for sure, it’s that everything won’t run well all the time. It is good system administration practice to plan for the worst case scenario. If your network goes down for any reason, (and it will), you will desperately need that extra CPU capacity when all your users try to log back onto your network. If you don’t have those reserve CPU resources available, you can end up with a cascading series of failures that becomes difficult to recover from as users overwhelm your system.

2) Not putting your RADIUS server in a VM

The primary benefit of running RADIUS inside a Virtual Machine (VM) is that it makes recovering from problems almost trivial. If you take regular snapshots of your VM (which you should), the cost of recovering from a problem is as simple as reverting back to a previous snapshot.

As we’ve already mentioned, it is a certainty that your systems will encounter some kind of problem at some point. For example: after a system update, or a library update, or if a file becomes corrupted. When RADIUS is not run in a VM, it can take hours or even days to find the problem and fix it. If you do run RADIUS in a VM however, it will take minutes to revert to your previous snapshot.

3) Enabling automatic updates

Updates are not in themselves a bad thing, you just shouldn’t do them automatically. If you run your RADIUS server in a VM (which you should), you should take a snapshot of your VM before installing the update. If the only thing running on your VM is your RADIUS server (see point #1) taking a snapshot shouldn’t take more than a couple of minutes. After you’ve taken the snapshot and installed the update, you can recover from any unexpected issues by quickly reverting back to your previous image.

4) Not rebooting after an OS update

Rebooting after an OS update is simply good system administration hygiene, but we’ve seen this issue often enough that we realize it might not be common knowledge.

When you update your operating system, the update isn’t actually “complete” until you have restarted your system.

We have encountered situations where an administrator performed a system update, and didn’t reboot until weeks or months later. If the update caused any problems, it can be extremely difficult to track down when you’ve forgotten about the update. We realize that nobody likes to reboot their system, and it never seems like there is a good time to do it. But you should do it anyways.

5) Not putting your RADIUS logs “off-box”

In keeping with our philosophy of designing for the worst case scenario, we recommend that all log files should be stored in different hardware than your RADIUS server. Keeping logs and server separate makes it much easier to recover from errors. If something catastrophic happens to your RADIUS server, you will still have your logs to help you diagnose the problem and recover. By distributing the potential points of failure, your overall system becomes more resilient to problems.

6) Not putting your database on separate hardware

We commonly see RADIUS configurations where the server and the database are running on the same hardware. It can seem like a reasonable approach because their functions are so intertwined. However, in practice it is usually a bad idea.

One of the reasons the RADIUS server and databases are often run on the same hardware is because most of the time, this works without any issues, and the potential problems usually manifest later on in the life of the system. Sometimes much later. This can lull administrators into a false sense of security, and make it difficult for them to diagnose the root cause of the problems.

There are important benefits to keeping your database and RADIUS server separate.

  • Database queries can be resource intensive and take up valuable CPU resources that the RADIUS server needs. Using separate hardware ensures that complex accounting queries are less likely to affect the RADIUS server.

  • Creating VM snapshots and reverting to previous ones is much simpler when that VM only contains the RADIUS server, and not your entire database.

  • The risk of clashes between libraries and dependencies is greatly reduced.

7) Not regularly exporting database backups

Optimism is not the most helpful system administrator trait.

It is extremely likely that at some point, your database will encounter a problem. Perhaps because of a new library update, or because someone accidentally typed the wrong thing and deleted all your records. When this happens (and it will), a backup of your database will make recovery trivial. If you don’t have a backup, it can be almost impossible to recover from a catastrophic failure.

8) Running admin queries against your production database

We often see clients running into problems when they run complex billing queries against the “live” production database. These queries can be very long lived and resource intensive, which make the database stop responding to FreeRADIUS. This outage can result in “unresponsive child” error messages in the FreeRADIUS logs.

If your system design includes a secondary database which is synchronized with the primary one, all complex queries should be run against the secondary database. If you only have a single primary database, we recommend exporting the data to a backup system and running the complex queries on the backup.

Need more help?

Network RADIUS has been helping clients around the world design and deploy their RADIUS infrastructure for 20 years. We specialize in complex systems and have seen pretty much every variation and problem out there. If you want help from the people who wrote FreeRADIUS, contact us for a consultation.

Read more...

Related articles