#WLPC 2017 PHX, #SingleChannelAdventures, and Looking Back

It has been a couple weeks since I returned from WLPC in Phoenix.  It was a great trip down to the southwest which included catching up with some good friends, listening to great presentations, learning a lot, and also presenting on a topic near and dear to my heart.  I was able to put a lot more names to faces and have some great conversation.

As many of you know I presented on how well single channel architecture works for us.  You can find the video below:

I have received mostly all positive feedback on the presentation.  It was a great opportunity for me to speak in front of a large crowd and give examples of how we have Meru deployed across our school district.  As well as it went, I feel like I need to explain a few things.  I have come to the realization that a Ten Talk may have been the wrong platform to discuss the topic at hand.  I understand that most people wanted more information about SCA rather than “it just works for us.”  Ten minutes was simply not enough time to discuss all of this.

First, I understand that being a large network doesn’t equal wired\wireless networking done right.  Even though this wasn’t conveyed in the presentation, I have received feedback indicating as such.  Although we are very proud of how large and successful our network is, it doesn’t mean that large = successful.  A lot of time and effort goes into making a wireless network work well for 60,000 users on average every day.  We all know that architecture doesn’t matter if you don’t define, design, implement, and validate.

Second, I gave some confusing information.  I realized shortly after the presentation that the stat showing that we “average ~12 connections per AP county wide” was very vague.  Some of our access points have upwards of 100 clients on them at once for an extended period of time.  Some of our access points have 10 clients total (maybe less) in an entire day.  It doesn’t matter whether there is 30+ clients on an AP during every class or another AP that sits unused for the majority of the day.  If an AP was placed in a location, it was done there with the intent that wifi could be needed at any point of an instructional day; planned or un-planned.  The opinion that there are too many APs or not is irrelevant.  Unless you have actually visited our schools and used the network, you won’t know.  The proof is in the pudding so to say.

I know that most of us consider our wireless networks as mission critical.  The vast majority of our schools don’t have desktop computers other than in administrative areas.  Many of our schools don’t have computer labs, but employ several carts of mobile devices.  I know that most of us want NUMBERS to back up the user experience, but most of the time I don’t have time to get this type of information.  If reports of “wireless problems” are low or non-existent and teachers are able to complete their instruction using mobile devices and be successful, our mission is accomplished.  I know a lot of you won’t be happy until you see NUMBERS and that is fine.  I am going to try real hard to get some of those and put them out there.  To be perfectly honest, I don’t know if some of you will believe it then, and that is fine too.

I understand many of the things that have been said as far as physics, airtime consumption, high density, etc.  I don’t necessarily disagree with some of the opinions.  You may be absolutely right!  The thing that is somewhat discouraging is that there are a lot of opinions based on no experience or an experience that was several years ago.  Don’t get me wrong, the first time I saw virtual port, even with my limited wireless knowledge at the time, I couldn’t believe it would work.  A lot has been improved upon since then with virtual cell and especially the equipment.  I’m not saying Meru will beat another vendor head to head every time, but I bet it will sometimes!  Does it really matter how many jigabits we can ram through the air or is it more important for a user to have an experience where they don’t even notice the wifi?  A user sits down, opens their laptop, completes a task, closes the laptop and moves on without even acknowledging the presence of wifi.  The wifi just works.  Please understand, I am not trying to discount numbers such as channel utilization, retries, available bandwidth, etc.  Those are all very important things to consider in a wireless environment.  We all look to those numbers first when trouble is initially reported.  Those are the numbers that give us a baseline to begin troubleshooting.  They are absolutely critical to a successful wireless deployment.

Obviously interest has been piqued since the presentation.  It has been fun, most of the time, discussing various aspects concerning single channel vs multi channel environments.  I have heard a handful of different people give a handful of different explanations on what Meru’s special sauce is, and they are all different.  There isn’t a ton of information out there concerning the “magic” of Meru but if you are truly interested please watch any video by Dr. Bharghavan, founder of Meru networks.  Many of these videos are a few years old, but still offer great information.  To be honest, I need to re-watch most of them as I get tangled in the “it just works,” sometimes.  Below are a handful of videos.

If you want to catch the videos later but still want to read the conclusion, please scroll down.

Meru Networks Wireless Virtualization Architecture – Part 1

Meru Networks Wireless Virtualization Architecture – Part 2

Meru Networks Wireless Virtualization Architecture – Part 3

Contention Management Schemes: Part 1 – Single / Multiple AP

Contention Management Schemes: Part 2 – Multiple APs

Maximizing Air Traffic – Part 1: Maximize Channel Reuse

Maximizing Air Traffic – Part 2: Simultaneous Transmissions

Leveraging Single Channel Architecture for Multiple Channels

 

A Little Dated But Still Good

Very High Density Wireless LAN Demonstration for BYOD: #1

Very High Density Wireless LAN Demonstration for BYOD: #2

Conclusion

For those of us who went to WLPC 2017, we heard more than one person mention that wireless networking can be done in more than one way.  We also heard that less than ideal practices may be employed against our better wireless judgement due to other factors such as politics, aesthetics, etc.  Sometimes I think we need to remember that just because someone does something different doesn’t mean that it is wrong.  We also need to remember that just because we don’t like a technology it doesn’t mean it doesn’t fit someone’s need.  I need to remind myself of this from time to time.  We deploy wireless networks, in schools, mines, warehouses, large refrigerators, outdoors; you name it, we put wifi in all kinds of places.  Our ultimate goal should be to use the knowledge we have to deploy a wireless network that gives a reliable experience to the greatest number of users.

Frozen Fi – The Big TX Freeze

frozenfi

We all find bugs in software from time to time and most of the time they are fixed in relatively short order…..maybe….

It is always an unsettling feeling when those bugs resurface in later versions of code after being “fixed.”

We recently received reports from a school that the wifi was “not working.”  Early on in troubleshooting I could see that stations were associated to APs in the reported troublesome area.  All the normal quick checks looked ok.  There wasn’t high channel utilization, retries were low, the APs were not overwhelmed with clients, etc, etc.  I touched base with the person who reported the trouble and they informed me that the issue affected all devices and that it seemed to be in one specific area; the library.  I again checked the APs via the web ui and upon quick glance everything still looked ok.  I logged into the controller CLI and issued the station database command to see what the clients were doing specifically on the library access points.  And then it got cold……..

txfreeze
Station database showing the dreaded TX freeze

As you can see in the graphic above the TX packets are few or none at all.  This problem occurred a couple years ago in later versions of System Director 6 code and was fixed with a newer release so I had seen the problem before.  I took down all of the necessary information so that I could put a ticket in with TAC then rebooted the two APs in the library.  Both of the APs affected were AP 832s and both were experiencing the same problem.  All other APs (AP1020s) were working fine.  After the reboot the APs returned back to working order.

I began putting my documentation together to submit a ticket when we received word that another school was having wireless issues.  This school was having trouble in it’s library as well.  Like most of our schools, all high density areas (for the most part) are serviced by AP832s.  I logged into the CLI of that school’s controller and found the same TX freeze occurring.  It became obvious to me at this point that the problem was likely affecting more than these two schools.  I checked a few other controllers and found most AP832s were in a frozen TX state.  I started to think what the common denominators could be.  All the APs affected were AP832s.  All the controllers were running the same code, System Director 8.1-2-0.

And there it was….

I glanced at the uptime of the AP832s on the controllers that I had up.  All of the AP832s had been up for 99 days.  I started going through other controllers and checking AP832s that had been up for 99 days and found them all to be in a TX frozen state as well.  A few I found that hadn’t reached 99 days were still operating normally.  Clearly there was a bug that reared it’s head on the 99th day of uptime.

So you are probably asking yourself why so many of the APs had reached the 99 day uptime mark at the same time.  I was wondering the same thing at first and then it hit me.  I had done a mass upgrade of controllers over the summer to System Director 8.1-2-0 which mostly put all of the controller/AP uptimes in sync.

We launched a preemptive strike and rebooted all AP 832s less a group of three APs at one middle school which hadn’t reached 99 days yet.  We knew that if we rebooted all of the APs the issue would be resolved for another 98 days or until FortiRu provided us a fix.  All of the Ap832s returned to working order.

At that point I assembled all of my documentation and submitted a ticket to TAC.  We did some initial troubleshooting but they needed to have some APs in the frozen state to get the information they needed.  We put the ticket on hold until our test group of three APs reached 99 days uptime.  That occurred over this past weekend, November 20.  When I returned to the office I checked the APs which were at an uptime of 102 days.  All three were in a TX frozen state.

AP832s with uptime of 99 plus, in a frozen TX state
AP832s with uptime of 99 plus in a frozen TX state
As of right now, the issue is with TAC.  It looks like we are the first to report the issue.  They have all the information that they requested; logs, diags, etc so I should be hearing back from them soon.  In the meantime, a quick reboot is just the ticket to thaw out the TX.

Here are a few commands I used to gather AP/radio data from within the AP CLI:

stadb display assigned

stadbdisplayassignedstadb display assigned -v <mac-addr>

stadbdisplayassignedmacstadb display rxq_info <client mac-address>

stadbrxqstadb display txq_info <client mac-address>

stadbtxqsys exec /wl -i radio0 msglevel err

sys exec /wl -i radio1 msglevel err

radio txqinfo radio0 (radio zero)

radiotxqinforadio0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

radio txqinfo radio1

radiotxqinforadio1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

radio txq radio0

radiotxqradio0radio txq radio1

radiotxqradio1radio display

radiodisplay

 

 

 

radio show radio0 (radio zero) – radio specific parameters

radio show radio1 – radio specific parameters

radio stats radio0 (radio zero) – radio specific stats

radio stats radio1 – radio specific stats

dev cmd radio0 reset – resets radio without rebooting the AP

dev cmd radio1 reset – resets radio without rebooting the AP

sys exec cat  /proc/meminfo

sysexeccatprocmeminfo

Single Channel Architecture and Virtual Cell and its Effect on Co-Channel Contention

Before you go ripping me, hear me out.

I was recently called out to a high school who had put in a trouble incident indicating that they were experiencing slow wireless performance and client disconnections in a particular area of the building.  When I arrived I found that there was a high concentration of clients in the cafeteria which numbered around 200, 150 of those on a single Meru AP832  (clients were primarily on one side).  Currently we employ AP832s (3×3 AC) in high density areas unless it is recent construction in which case the whole school is done with the AP832.  The cafeteria in this high school along with all other high schools in the district qualify as high density areas and are outfitted accordingly.  I fired up MetaGeek inSSIDer for a quick glance to make sure everything looked ok.  I quickly noticed that I had two BSSIDs on the same channel very close in dBm on both bands, which we all know is cause for concern.  My man Rowell Dionicio explains this issue well in a recent blog post at Network Computing.  You might be thinking, “Wait a tick #MeruMitch!  Isn’t that what single channel architecture is?”  Not really.  In this particular instance I experienced a bug where a handful of AP1020s (2×2 N) thought that one of two radios inside were a radio from an AP832.

Single Meru AP1020. 2.4 GHz radio
Single Meru AP1020. The 2.4 GHz radio thinks it is in an AP832, where the 5 GHz radio is correct

This effectively broke my virtual cell (virtual BSSID) which in turn created co-channel contention.  For those not in the know, a virtual cell is a virtual BSSID that Meru groups all of it’s physical BSSIDs behind to create it’s “special sauce single channel architecture“.  Watch the video and… DON’T SHOOT THE MESSENGER!

At the time I did not capture any further information so for the sake of this blog post I re-created the scenario at my house.  Going forward please pretend that 5NINER-Meru = LCPS.  Also the “Shed” AP was actually moved into my house for the purpose of this dramatic re-creation.

5NINERBrokenCell

5NINERBrokenCellPhysical

5NINERESSCells
Broken virtual cell. Two AP832s on 1/100 and one AP1020 on 1/100. The AP832s retain their virtual BSSID, but the AP1020 being on the same channel creates another virtual cell which is causing co-channel contention.

There were a total of seven AP1020s which suffered from this bug and a few of them were within hearing distance of where this high concentration of clients was located.  I checked the controller and found the channel utilization through the roof in this area.  I removed the corrupted APs from the controller and added them back.  I set all AP1020 radios back to 11/44 and AP832 radios back to 1/149.  You did read that correctly…ALL.  Both virtual cells were back in business.

5NINERProperVCell5NINERProperVCellPhysical

5NINERProperESSCells
Two unique properly configured virtual cells. Same APs as before. Two AP832s on 1/100 and a single AP1020 on 11/44.

Ok, now pick yourself back up off the floor and/or clean up the brain matter that exploded all over the wall just now when your head exploded.  Brain matter on the wall is unbecoming.  I digress.  Once the physical radios were all back behind the virtual BSSIDs created by the ESS profile, the client’s performance dramatically improved and everyone was happy again.

You may argue that this would be simply a one off issue due to the AP radio corruption that took place.  In fact, it is not.  It would be very easy to have a problem like this occur with the APs working as they should in an improperly designed network.  To properly design a Meru single channel wireless network you should group like model APs in their own virtual cell when in a mixed model environment.  A group of AP832 AP’s virtual cell will have a different virtual BSSID from a group of AP1020 APs which in turn means each virtual cell should be assigned unique channels.  They can all happily co-exist in the same ESS profile though.

5NINERESSCellsExplained
Same ESS profile. Broken virtual cell causing CCC.
5NINERProperESSCellsExplained
Same ESS profile. Proper channel assignment creating two unique virtual cells.

I understand the controversial side of single channel architecture but the fact of the matter is that it works great when it is designed and installed CORRECTLY!  All the same RF fundamentals, designing, deploying, etc are relevant when using single channel architecture.  Poor marketing among other things give it a bad name.  Taking it out of the box, racking the equipment, throwing everything on the same channel, and letting it rip WILL yield poor performance.  Counter to some marketing I have read, single channel architecture is NOT for the network group that “doesn’t have time to worry about maintaining the wireless network” or “doesn’t have time to properly survey.”  That is real folks!  I have read or heard that garbage before.  So please…I ask from my little single channel heart, give me and my precious FortiRu a chance.  We definitely deserve a spot “in the ballpark” among the big boys.

Yours in loving single channel goodness,

#MeruMitch

PS – There will be a Whiskey and Wireless podcast in the near future where I talk about my single channel adventures.  Keep an eye on the Whiskey and Wireless Twitter feed.