We all find bugs in software from time to time and most of the time they are fixed in relatively short order…..maybe….
It is always an unsettling feeling when those bugs resurface in later versions of code after being “fixed.”
We recently received reports from a school that the wifi was “not working.” Early on in troubleshooting I could see that stations were associated to APs in the reported troublesome area. All the normal quick checks looked ok. There wasn’t high channel utilization, retries were low, the APs were not overwhelmed with clients, etc, etc. I touched base with the person who reported the trouble and they informed me that the issue affected all devices and that it seemed to be in one specific area; the library. I again checked the APs via the web ui and upon quick glance everything still looked ok. I logged into the controller CLI and issued the station database command to see what the clients were doing specifically on the library access points. And then it got cold……..
As you can see in the graphic above the TX packets are few or none at all. This problem occurred a couple years ago in later versions of System Director 6 code and was fixed with a newer release so I had seen the problem before. I took down all of the necessary information so that I could put a ticket in with TAC then rebooted the two APs in the library. Both of the APs affected were AP 832s and both were experiencing the same problem. All other APs (AP1020s) were working fine. After the reboot the APs returned back to working order.
I began putting my documentation together to submit a ticket when we received word that another school was having wireless issues. This school was having trouble in it’s library as well. Like most of our schools, all high density areas (for the most part) are serviced by AP832s. I logged into the CLI of that school’s controller and found the same TX freeze occurring. It became obvious to me at this point that the problem was likely affecting more than these two schools. I checked a few other controllers and found most AP832s were in a frozen TX state. I started to think what the common denominators could be. All the APs affected were AP832s. All the controllers were running the same code, System Director 8.1-2-0.
And there it was….
I glanced at the uptime of the AP832s on the controllers that I had up. All of the AP832s had been up for 99 days. I started going through other controllers and checking AP832s that had been up for 99 days and found them all to be in a TX frozen state as well. A few I found that hadn’t reached 99 days were still operating normally. Clearly there was a bug that reared it’s head on the 99th day of uptime.
So you are probably asking yourself why so many of the APs had reached the 99 day uptime mark at the same time. I was wondering the same thing at first and then it hit me. I had done a mass upgrade of controllers over the summer to System Director 8.1-2-0 which mostly put all of the controller/AP uptimes in sync.
We launched a preemptive strike and rebooted all AP 832s less a group of three APs at one middle school which hadn’t reached 99 days yet. We knew that if we rebooted all of the APs the issue would be resolved for another 98 days or until FortiRu provided us a fix. All of the Ap832s returned to working order.
At that point I assembled all of my documentation and submitted a ticket to TAC. We did some initial troubleshooting but they needed to have some APs in the frozen state to get the information they needed. We put the ticket on hold until our test group of three APs reached 99 days uptime. That occurred over this past weekend, November 20. When I returned to the office I checked the APs which were at an uptime of 102 days. All three were in a TX frozen state.
As of right now, the issue is with TAC. It looks like we are the first to report the issue. They have all the information that they requested; logs, diags, etc so I should be hearing back from them soon. In the meantime, a quick reboot is just the ticket to thaw out the TX.
Here are a few commands I used to gather AP/radio data from within the AP CLI:
stadb display assigned
sys exec /wl -i radio1 msglevel err
radio txqinfo radio0 (radio zero)
radio txqinfo radio1
radio txq radio0
radio show radio0 (radio zero) – radio specific parameters
radio show radio1 – radio specific parameters
radio stats radio0 (radio zero) – radio specific stats
radio stats radio1 – radio specific stats
dev cmd radio0 reset – resets radio without rebooting the AP
dev cmd radio1 reset – resets radio without rebooting the AP
sys exec cat /proc/meminfo