UCPrimer
  • Tech Blog
  • About UCPrimer.com

Best Practices for Skype for Business 2015 High Availability

6/30/2015

5 Comments

 
Picture
High Availability (HA) and Disaster Recovery (DR) capabilities continue to improve in Skype for Business Server 2015 to provide even greater uptime reliability and performance for end users. Fundamentally the Front-End HA/DR capabilities are built upon the Windows Fabric v1 foundation that was introduced in Lync Server 2013, and now improved with v3 in Skype4B server, For the Back-End servers the redundancy is built on SQL Mirroring or Always-On technologies This article aims to provide some of the best practices for Front End Server HA+DA in Skype4B server. For readers already familiar with these concepts, then this article can also serve as a refresher and/or reference.
Ensuring Pool Quorum and Routing Group Quorum
HA in the Skype4B Front-End pool is the ability of the server pool to continue servicing users in the event of a failure of 1 or more servers within the pool. Front-End pool HA is based on clustering via HLB and DNS load balancing built into the servers to automatically distribute groups of users across the various front end servers in a pool. During startup, 85% of all servers in the pool must be started in order for the entire pool to be functional. This is known as achieving Routing Group level Quorum. After the pool is operational, then only 50% of the servers in the pool must be running in order for the pool to remain functional. This is called achieving Pool Level Quorum and is not to be confused with Routing Group Level Quorum. The table below shows the numbers for these two quorum types:
Picture
To help achieve quorum, we should never define more servers in the topology than what is actually deployed. If the number of servers fall below the 85% Routing Group Level quorum but there are still enough servers to maintain Pool Level quorum, then the Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery cmdlet can be used to reload user data from the backup store for any routing groups currently in quorum loss.
Deploy at least 3 FE servers
Although a 2-node FE server pool can be deployed, the minimum recommended number of  FE servers is 3. This is due to how Windows Fabric distributes users into Routing Groups which are automatically created and increases as more users are added. Each Routing Group has 3 replicas of user data: Primary, Secondary and Backup Secondary. User requests are always serviced by the Primary replica and all users assigned to a Group are always homed on the same FE server. Windows Fabric v3 will perform synchronous writes to all three replicas and only periodic writes to the back-end BLOB database for rehydration purposes with the exception of Conference State data. To view the replica set information for a particular user, we can use the Get-CsUserPoolInfo "domain\username" cmdlet:
Picture
As shown in the screenshot above, the user "jreacher" is assigned a primary replica at lyncfe2 with lyncfe1 and lyncfe3 as secondary replicas. With a 2-node only FE Pool, Windows Fabric will be unable to build the 3-member replica set and therefore will perform synchronous writes to the backend database, affecting overall server performance.
Recovering from Quorum Loss
Lets start by reviewing the user experience when quorum loss occurs. Since all user requests are serviced by the Primary replica, if one of the secondary replicas were to go down, there will be no impact to the user at all. If the Primary replica goes down, then one of the Secondary replicas will be promoted to become the Primary. The user will only experience a brief disconnect before full services are restored. If two of the replicas were to go down at the same time, the user will be placed into "Limited Functionality" mode while all three replicas down means the user will be disconnected and the client will forever be in a "Reconnecting..." state.

To recover from quorum loss, the administrator needs to run the Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery cmdlet to tell Windows Fabric to rebuild the replicas on the remaining FE servers, assuming there are enough to maintain pool quorum. This cmdlet reloads user data from the backup store for any routing groups currently in quorum loss but any data not yet written to the database could be lost when doing this.

It's worthwhile noting that the "FullReset" option for this cmdlet performs the same type of reset as "QuorumLossRecovery" but in addition, rebuilds the local Skype for Business Server 2015 databases. This type of reset can be potentially long and resource-intensive and is typically only used when changing a topology from a pool with a single Front End server to a pool with multiple Front End servers. Using the "FullReset" value when attempting to restart a pool will sometimes result in failure, and the pool will not actually restart. Hence, never do Reset-CsPoolRegistrarState -ResetType FullReset without MS Support Assistance.
Avoid deploying even number of FE servers when using SQL Mirroring
When determining the number of servers running in a pool, Windows Fabric uses a voting process that requires an odd number of voters. So for example in a 4 server FE pool, the SQL backend server will be added as a voting member. When SQL Mirroring is used, only the primary node in the SQL Mirror is assigned to be voter. This means that should the SQL Mirror failover to the secondary node, there will already be 1 less vote in the whole pool. If subsequently two more FE servers were to go down, there will be Pool Quorum Loss and the entire pool will go down, even though technically 50% of the pool is still up and running. Hence its recommended to implement an odd number of FE servers in a Pool when SQL Mirroring is used for backend database HA.
Picture
Virtualizing Servers in a FE Pool
While Skype4B servers support virtualization across all workloads, including instant messaging (IM) and presence, conferencing, Enterprise Voice, Monitoring, Archiving, and Persistent Chat, proper placement of servers is vital to ensuring HA works properly in the event of server failures. Since we know that Windows Fabric always creates 3 replicas for every Routing Group and that all users are served by the primary replica of the Routing Group, it will make no sense to place two or more replicas in the same Virtual Host. Should this host go down then immediately the Routing Group will lose quorum and users will experience downtime, even though Pool Quorum may not be lost. Take for example the 5 node FE pool with servers virtualized according to the diagram below:
Picture
From the above diagram, if we lost Hyper-V Host A due to hardware failure, all users belonging to Routing Group 1 will experience Routing Group Quorum loss and be placed into Limited Functionality mode. While users in Routing Groups 2 and 3 are unaffected since Pool Quorum is still maintained. Therefore, the best practice is place only a single FE node in a Hyper-V host.
5 Comments
Jindeep
6/19/2016 06:41:00 am

Hi,
I have placed 3 Skype for business Front end Enterprise Servers in one pool.While checking high availability functionality, user got logged off as soon as we put one FE server down and it takes ard 30-50 sec to reconnect SFB Client. Is this normal behaviour ?

Reply
Brennon link
6/27/2016 12:43:49 am

Yes this is normal and happens when the FE server that you put down happens to be the primary replica for the user. The SfB client should reconnect automatically after 30sec or so.

Reply
carl
7/18/2017 10:21:35 pm

from my experience, if you want to bring a frontend down for maintenance etc bring it down with the graceful switch.

stop-cswindowsserver -graceful

this should bleed users over to different frontends cleanly (sometimes users will experience a 1-2second disconnect/reconnect automatically. but not all the time.

carl
7/18/2017 10:23:02 pm

sorry should be
Stop-CsWindowsService -Graceful

Brennon link
8/1/2017 02:42:41 am

Hi Carl

Yes thanks for pointing this out.


Your comment will be posted after it is approved.


Leave a Reply.

    Picture
    Picture

    Important Links

    Microsoft Teams Docs
    Microsoft Learn

    ​Microsoft MVP Blogs

    Michael Tressler’s Blog
    Michael’s MTR Quick Tip Videos
    Jimmy Vaughan’s Blog
    Jeff Schertz
    Adam Jacobs
    James Cussen
    ​Damien Margaritis

    Archives

    September 2022
    August 2022
    March 2022
    February 2022
    January 2022
    December 2021
    November 2021
    October 2021
    September 2021
    August 2021
    June 2021
    April 2021
    March 2021
    December 2020
    October 2020
    September 2020
    August 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    July 2019
    March 2019
    November 2018
    October 2018
    September 2018
    August 2018
    June 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    August 2017
    July 2017
    April 2017
    March 2017
    February 2017
    January 2017
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    May 2016
    April 2016
    March 2016
    January 2016
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015
    June 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014
    February 2014
    January 2014
    December 2013
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    November 2012
    September 2012
    August 2012

    Categories

    All
    Edge
    Exchange 2013
    Hybrid
    Lpe
    Lync 2010
    Lync 2013
    Mobility
    Oauth
    Office365
    Polycom
    Ucs

    RSS Feed

    This website uses marketing and tracking technologies. Opting out of this will opt you out of all cookies, except for those needed to run the website. Note that some products may not work as well without tracking cookies.

    Opt Out of Cookies