What does "unresponsive child" error message mean?

Most likely, your database is slow

Like any system, FreeRADIUS provides error messages to inform administrators of problems within the FreeRADIUS server itself. Sometimes however, error messages that are logged by FreeRADIUS are actually reporting an indication that something is wrong with one of the connected systems. A common cause of some of these errors is an unexpectedly slow database.

Why am I getting an unresponsive child error message?

This message is essentially a timeout error. It means that a request was sent to the database, and the database hasn’t responded within a set time limit. Most database queries should happen instantaneously from the perspective of a human being, which means much less than one second. In contrast, the unresponsive child message is produced by FreeRADIUS after about 30 seconds! That is, FreeRADIUS is letting you know that it hasn’t heard back from the database in a very long time. This long time to respond indicates a major problem with the database.

One common question is “what is going wrong with FreeRADIUS when it produces this message?” The answer, of course, is nothing. FreeRADIUS depends on other systems to do its work. If those systems go down or are slow, then FreeRADIUS will log a warning or error message indicating what is going wrong. The solution is not to fix the system which logs the error, but instead fix the cause of the error.

The timeout here is determined by the max_request_time configuration item, which is set in the main “radiusd.conf” file. We do not recommend increasing the default value. Changing the value will make FreeRADIUS wait longer, but it will not make the database faster! The error message is describing a symptom: a slow response from the database. The cure is to fix the underlying problem.

Ignoring duplicate packet, and Conflicting packet messages

These messages are usually related to the unresponsive child error. They are all indications of a slow response time from the database.

After the NAS (Network Access Server) has been waiting a set time for a response from the RADIUS server, it will usually try the same request again a few times. This results in an ignoring duplicate packet error message. The RADIUS server is basically saying “I recognize this as a duplicate request. I’m still working on the original request, I don’t have a reply yet, so I can only ignore this repeated request”.

Eventually, the NAS will give up on that initial request and reuse the same packet ID for a completely different request. Packet IDs are only 8 bits, so they get recycled frequently. This results in a conflicting packet error message. The RADIUS server recognizes the packet ID as a request it still hasn’t dealt with, and sees that the payload is different from what it currently associates with that ID.

Typically, you will see the several messages for ignoring duplicate packet appearing in your log file first, perhaps a conflicting packet message, and finally followed by an unresponsive child error. This sequence reflects that the NAS tried the same request a few times (duplicate packets) and then eventually gave up and reused the ID for a different request (conflicting packet). Somewhere in this process, the RADIUS server has then reached the time limit for processing a request, and jas logged an unresponsive child error.

“But my database is fast! It must be something else.”

Sorry, but no. If you are seeing an unresponsive child message in your logs, it is almost always because the database took too long to respond to a request. If you are not using a database, the unresponsive child message can be caused by a slow script, or external system. In some cases, the logging system is slow, and can only handle a few thousand messages per second!

FreeRADIUS has been benchmarked at many tens of thousands of packets a second. So it can handle any reasonable load. If the load on the system is less than 100K packets per second, then the error is almost always not FreeRADIUS.

The unresponsive child message is somewhat akin to pinging an IP address and getting a timeout response. It does not mean that there is a problem with the ping utility. It means there is a problem with reaching that IP address.

It is not uncommon to have a system that is operating without any performance issues for several months or even years before this error appears. If this is the case, that means that the database or other systems which use it has recently changed.

What to do about an unresponsive child error

Like any kind of system debugging, there isn’t a one-size fits all solution. However, there are some steps you can take to try to narrow down the source of the problem.

1) Check your database slow query log.

Most databases have a feature to record SQL queries that take a long time to execute. Sometimes these log files are disabled by default so you will need to turn them on. Each database is slightly different but generally you can define a long_query_time parameter Here are links to more information about slow query logs for MySQL, LDAP, Microsoft Azure, and PostgreSQL.

2) Run some typical SQL queries and measure how long it takes to respond.

If you’re not sure how to measure the response time we have a simple rule of thumb: If the time between the query and response is humanly detectable, it is too slow. Because every environment is unique, we’re not able to recommend how to construct a “typical” SQL query. Only you will know what is typical for your unique situation.

3) Make sure the disk for your database isn’t full.

This is one of the most common issues we see when users suddenly get an unresponsive child message after happing FreeRADIUS for some time. The amount of storage used tends to expand, and can eventually reach the available space limit if not actively monitored.

You can check the status of your disk space by running the df command line utility, or review the log files of your database for warning messages about your disk approaching capacity.

If you find that your disk is full, or getting close to it, you will need to free up some space. Here are your main options:

Remove unnecessary files
Purge binary log files
Drop old tables, or rebuild a very big table

If you are using MySQL, this article has a good step-by-step breakdown.

Bear in mind that the long term solution is mostly likely to upgrade your hardware to accommodate your storage needs.

4) Make sure the the CPU for your database isn’t maxed out

Depending on the database you use, there are a variety of utilities available to check CPU usage. If you discover that the database CPU usage is indeed much higher than usual there may be some infrastructure design issues that are contributing to this. Some examples we have seen include:

The RADIUS server and database are on the same VM as several other resource intensive applications. If any of those other applications start to use more resources than usual, everything else on the VM suffers.
There are other services or applications that are overloading the database. For example, an accounting system querying the database in order to build monthly reports can have a major impact on the response times of the database to the RADIUS server. To avoid this type of issue, you should ensure that the RADIUS databases are not used by any other systems. We will be discussing this issue in a future article.

Fundamentally, it is important to bear in mind that your RADIUS server and identity store database are part of your critical infrastructure and need to be resourced accordingly. When assigning disk space and CPU capacity in your network design, you must ensure that the RADIUS server and database have enough free capacity to manage sudden spikes in traffic or CPU usage. Rather than allocating resources according to “typical” scenarios, you should plan them for the more extreme cases. While it might seem like overkill when the extra CPU and disk space sits unused most of the time, it will save your network from collapsing when something unexpected happens. Which it always does.

Still need help?

In our experience with clients and in troubleshooting conversations we manage in the FreeRADIUS mailing list, an unresponsive child error is a symptom of a fundamental underlying issue with the larger infrastructure design. If you have a complex network environment and need expertise from the people who wrote FreeRADIUS, contact us for a consultation.

Published 2021-02-10 12:00:00 +0000
Categories: articles