Software automation has been around for many years. However, the cloud extends that concept to the servers, network, and firewalls. It supports rapid deployment and clearing of environments, making it useful for prototyping and sophisticated testing. Those tests include disaster recovery, fully loaded failover, and server right-sizing.
The dominate web server software by market share is Apache, NGINX, and Microsoft IIS, as shown in the following chart. Apache is the most flexible because of the large number of configuration options. NGINX is faster and often used as a proxy in front of the webserver. IIS has higher running costs resulting from Windows licensing fees and the lack of capabilities in the operating system, like process forks that result in less efficient solutions.
Linux and its distributions have over 80% of the web server market, as shown in the following chart. The open-source concept behind Linux led to a more vibrant community. That resulted in transparent architecture, kernel tracing tools, lower resource utilization, and other capabilities that make it easier to fix and stabilize applications. However, Windows still has better support for less common software drivers. The free licensing agreement and large install base for Debian, CentOS, and Ubuntu make it practical to run production without support contracts. Windows and RedHat have associated fees while other systems lack the install base to maintain without a support agreement.
: Cloud solutions make it possible to replace a failed server on demand. It reduces cost by eliminating the need for server redundancy. It also reduces costs by supporting repeated prototyping and deployment testing.
Redundancy provides multiple components with excess capacity to survive partial failures. They are robust to the broadest range of issues when they have share-nothing architecture because the failure of one node has no impact on the others. Always sending user load to every node reduces the percentage of users impacted by a crash. It ensures all recovery nodes are working and reduces the changes needed to complete failover. Each point increases availability. The design also has faster response times.
Adding network connections to a data center enhances availability. It simplifies network restructuring, managing router failures, and mitigating denial of service attacks. Datacenter connections are on the multi-route Border Gateway Protocol (BGP) network. Websites have anywhere from a single BGP connection to over 100, as shown in the following chart.
Designing websites for recovery makes them less likely to crash, restores operations more quickly, and increase the probability of success. It goes beyond taking backups to test the following scenarios.
Most sites lack the encrypted communication, server hardening, and secure hosting needed to deter basic attacks. Detection of malware can take months, resulting in corrupted backups. Content Management Systems (CMS) like WordPress make it challenging to separate application code from customizations and malicious content.
Human errors, especially during deployment, are the most frequent cause of outages. Automation leads to the rapid deployment of environments allowing for extensive prototyping and detailed testing. It is easy to check rollover to production, rollback from failure, disaster recovery, system overloading, and security hardening. Automation allows companies to move human errors from production to testing environment.
The probability of server failure in the first year is 5% and increases after that. Routers and other equipment can also fail. Virtualization simplifies hardware recovery by supporting rapid replacement. Examples include replacing a server, booting the disk image into a newer server model, and replacing disks while the website is running.
Accelerated software updates result from faster changes to business requirements and more frequent security patching. Stabilizing the solution requires more careful update controls. Automated testing provides a rapid way to detect failures. Monitoring detects evolving issues and reduces the time between an outage and its detection. Embedding rollback into the design recovers from failures that were not picked up with testing.
Datacenter outages result from internal and external issues. Planning to recover from these scenarios involves migrating operations to other data centers. Running standalone website instances in several regions offers the best protection.
Recovery testing simulates failure and validates the resumption of operations. Techniques to simulate disasters include changes to network routing, shutting down a server, deleting disk content, and overloading systems. Cycling through repeated tests provides the working environment to tune the solution rapidly. Rerunning disaster recovery tests are more straightforward in the cloud due to the low cost and speed of deployments.
Logged errors are a leading indicator of outages. Addressing them in a non-crisis mode leads to higher quality solutions and reduces the noise when an outage does happen. It covers logs from operating systems, web servers, databases, routers, firewalls, and other components. Periodic reviews should look at stability, security, performance, and monitoring. If an outage does happen, there will be less clutter masking the underlying issues meaning the fix will take less time.
Overload testing establishes safe limits for user loads. Tuning for an overload ensures the system recovers immediately when the load subsides. Instead of hanging, it quickly rejects new requests when there is insufficient capacity. That increases throughput by concentrating resources on successful completion. It also helps protect against Denial Of Service (DOS) attacks.
Endurance testing verifies the absence of resource leaks such as memory and open files. Ideally, tests include both successful and unsuccessful requests. Leaks come from a variety of sources. For example; programs allocate and release memory, but it never goes back to the operating system. Testing for this condition establishes an appropriate setting for the life span of a webserver process.
The goal of monitoring is to identify issues before they cause an outage. They also reduce the time between failure and detection. Modern designs increasingly launch recovery scripts to minimize outage duration.
Measuring availability is the most common monitor. Simple checks can frequent to reduce the time between failure and detection. More involved checking ensures all the subsystems work, but calling them too frequently increases system load. Originating calls thousands of miles away ensure the data center connects onto the internet. More rigorous monitoring increases operational costs but should have the goal of minimizing.
System resources include CPU, disk, and memory. Trigging alerts based on expected loads helps identify impending performance issues. Solutions that reduce costs by removing excess hardware capacity is more sensitive to load fluctuations and are in more need of this type of monitoring.
Security monitoring flags infrequent administrative activity. Typical examples are administrative login, content changes on the website, and updates to the DNS registrar. The goal is to detect uncommon but unexpected events that pose a significant risk.
Equipment monitoring includes printers, routers, and firewalls in offices and data centers. It uses the Simple Network Management Protocol (SNMP). They provide a more meaningful alert and detect issues before the user calls the help desk.