The Definitive Guide to Cloud Management, Part Two: Special Cases - Hybrid Cloud Management and Automation | Morpheus

The Definitive Guide to Cloud Management, Part Two: Special Cases

Table of Contents

One unified network, one seamless, intuitive interface

Multiclouds bring out the best in cloud management platforms

Cloud services’ role as foundation for advanced analytics

Advanced cloud functionality, part one: Self-monitoring

User demands and cloud complexity drive the need for self-monitoring

Advanced cloud functionality, part two: Explicit notification

Advanced cloud functionality, part three: Failover/self-healing

Seamlessly integrate resources via cloud service APIs

Four tips for smooth integration of third-party APIs

Cloud capacity management, part one: Dynamic resource scaling anticipates demand

Cloud capacity management, part two: Use scaling to optimize app performance

Requirements of a dynamic resource allocation framework

IT’s role as conductor of the cloud-app orchestra

‘Cloud management’ and ‘information management’ are becoming indistinguishable

Table of Illustrations

Figure 1: In a typical multicloud scenario, collections of services are mixed and matched to create applications on the fly, a fluid process that replaces “monolithic applications.”

Figure 2: Cloud management platforms simplify multicloud integration by abstracting the interface to public and private clouds, as well as to internal systems and third-party services/APIs.

FIgure 3: The principal advantages of cloud analytics over on-premises BI are upfront costs and implementation time, while the primary disadvantages are customization and control of data security standards.

Figure 4: Descriptive analytics tell you what happened and why; predictive analytics tell you what might happen; and prescriptive analytics tell you what you should do about it.

Figure 5: Application performance monitoring that includes real user monitoring or deep-dive monitoring of container resources requires an agent-based approach, whether active or passive.

Figure 6: A reference architecture for hybrid cloud management integrates multiple external data sources, bundling governance and security around core resource, financial and other services.

Figure 7: Pinging offers advantages over TTL for reporting the state of microservices because the checks remain outside of the service itself, thus reducing complexity.

Figure 8: The skyrocketing demand for cloud API integration is driven by growth in cloud services for CRM, marketing, finance, and e-commerce.

Figure 9: This model of cloud-based API management separates the responsibilities of the web API gateway and the API management functions from the enterprise’s internal architecture.

Figure 10: AWS Auto Scaling launches new instances in the Availability Zone with the fewest instances, or in a new zone when the first launch attempt fails. If there are multiple subnets in a VPC Availability Zone, a subnet from the zone is selected at random.

Figure 11: Autoscaling configures services so they automatically provision more instances whenever the defined capacity limit is reached, but in the brief time before autoscaling kicks in, the service is throttled to ensure it remains available.

Figure 12: In eight years of annual IT surveys, cloud computing went from zero mentions to be the highest salary by functional area, second in top interests, and third in most difficult hiring area.

 

 

For enterprises making the transition from on-premises IT to cloud services, the process can seem like converting an ocean liner into a wide-body jet while cruising at full speed in the middle of the Pacific. Forward-looking CIOs see the opportunities as well as the challenges. After all, how often do you get the chance to reimagine your company’s information-management strategy?

Considering the ever-increasing value of your organization’s data resources, the potential payoffs of the cloud transition are unprecedented in the annals of IT. Software-defined networks, continuous integration/continuous development, and other innovations made possible by cloud computing promise to increase worker productivity, primarily by making data processing more secure and more efficient.

All that is required to realize this ideal is the total transformation of your data networks, a process that for most companies is well underway. The organizations most likely to come out on top are those that approach the cloud as a “carpe diem” moment: Seize the opportunity to deliver a single intuitive interface to employees regardless of platform and to position the IT department as keeper and protector of all the firm’s data resources.

One unified network, one seamless, intuitive interface

In the multicloud infrastructures that predominate in today’s firms, interfaces are abstracted into a separate layer. That’s the only way to keep users from going nuts as they switch between apps and networks repeatedly in the course of a typical business day. GigaOM‘s David S. Linthicum describes the change in IT focus away from “monolithic applications” and toward a “common services catalog” that links back to a diverse range of cloud types.

Figure 1: In a typical multicloud scenario, collections of services are mixed and matched to create applications on the fly, a fluid process that replaces “monolithic applications.” Source: GigaOM

Cloud offerings thus become components of a wide-ranging services catalog that supports the assembly of applications created, changed, and updated nearly on demand. The more cloud options in your app-component portfolio, the easier it is to realize the value of the agility the cloud makes possible. At the same time, the more cloud services you offer, the more complex the management and deployment tasks.

Many organizations address the added complexity of a multicloud infrastructure by taking advantage of the native interfaces and consoles of cloud providers such as AWS, Google, and Rackspace. Linthicum points out the many shortcomings of this approach, which he calls “provider native governance and management.” First off, these interfaces don’t scale, so they are operationally ineffective over the long haul. Even more troublesome for companies is the increased complexity of managing multiple consoles and the inability to automate management tasks.

Multiclouds bring out the best in cloud management platforms

Some companies choose to apply multicloud governance for tracking, security, and management by integrating policies at the service and API levels. Doing so can complicate management, particularly as the number of services and APIs grows. A simpler alternative is to use a cloud management platform that provides actual cloud resources (compute, storage, and databases) rather than interfaces to the resources via services or APIs

The single pane of glass that CMPs offer puts power in the hands of the people who use and manage all aspects of the multicloud. They create an abstraction layer that connects users/managers and all public/private cloud and on-premises systems. The single pane also functions as a lingua franca for automation services such as Chef and Puppet.

Figure 2: Cloud management platforms simplify multicloud integration by abstracting the interface to public and private clouds, as well as to internal systems and third-party services/APIs. Source: GigaOM

The policy-based approaches used to automate management of various cloud resources can be applied to many back-end cloud technologies. This facilitates provisioning across heterogeneous network components, such as separate clouds for databases, compute resources, and storage. Another benefit of a CMP is the ability to place an abstraction layer between enterprise IT and the multicloud resources under IT’s control.

Cloud services’ role as foundation for advanced analytics

Whether it’s called real-time stream processing, streaming analytics, predictive analytics, prescriptive analytics, or some other catchy phrase, the goal remains the same: Contextualize and otherwise make sense of data at (or near) the time and place the data is initially collected. According to a study conducted by BARC Research and Eckerson Group entitled BI and Data Management in the Cloud: Issues and Trends 2017, the percentage of companies using cloud business intelligence solutions has increased from 29 percent in 2014 to 43 percent in 2017.

According to the Deloitte, EMA and Informatica State of Cloud Analytics Report, 70.1 percent of organizations view the cloud as “essential” to their analytics strategy, and 21.6 percent state that cloud is a “necessary” component of their analytics operations. The three technical factors having the greatest impact on the growth of cloud analytics, according to the Deloitte report, are lower costs (19.5 percent), data security (15.1 percent), and collaboration (14.4 percent).

Even more telling, just under half of the executives surveyed by Dresner Advisory Services for its 2017 Cloud Computing and Business Intelligence Market Study consider cloud BI as critical or very important to their organizations’ information requirements. The study identifies the four industries that rely most heavily on cloud BI: financial services, higher education, business services, and retail.

Figure 3: The principal advantages of cloud analytics over on-premises BI are upfront costs and implementation time, while the primary disadvantages are customization and control of data security standards. Source: Datapine, via BI+Analytics Conference

The field of big data analytics wouldn’t be possible without the cloud. Yet what we currently refer to as “big data” is about to be dwarfed by the tidal wave of information to be generated by the internet of things. In the Wall Street Journal’s CMO Today, Deloitte Analytics advisor Tom Davenport describes the “analytics of things” as primarily descriptive (reporting or business intelligence) rather than predictive or prescriptive (recommending specific actions).

Davenport identifies five categories of IoT analytics:

1. Descriptive analytics are usually presented in graphic forms, such as bar and line charts.

2. Diagnostic analytics use statistical modeling to identify variables and relationships in data.

3. Predictive analytics are most noteworthy at present for predictive maintenance to anticipate and address problems before they occur. However, their greatest value may ultimately be in predicting accurately when things will run as expected (minimizing risk).

4. Prescriptive analytics add the ability to select the best course of action to take in response to current or future conditions.

5. Automated analytics remove the human element from the decision-making process. As you would expect, automatic decisions are appropriate only when the best course of action is obvious. Examples include streetlight patterns based on traffic data, and automatic injection of certain drugs based on data received from medical devices.

Figure 4: Descriptive analytics tell you what happened and why; predictive analytics tell you what might happen; and prescriptive analytics tell you what you should do about it. Source: Gurobi

Advanced cloud functionality, part one: Self-monitoring

The promise of artificial intelligence is to complete work tasks more efficiently and more reliably than humans. AI has been applied to the management of cloud-based data assets — at least theoretically. In an April 7, 2017, article on DatacenterDynamics, ZeroStack executive Steve Garrison outlines the requirements for a “self-driving” cloud.

Step one is to automate installation and configuration of servers, storage, and network components via a software-defined network. All software is pre-installed and “baked into” the operating-system image. After imaging and powering up a handful of servers, the cloud comes online without admins participating — or even knowing about it.

Subsequent steps cover integration with internal systems and other clouds; self-service application deployment; real-time monitoring of events and performance numbers for logging and auditing; self-monitoring and self-healing; machine learning to enable long-term decision making; and automatic environment upgrades.

As you might expect, achieving any of these steps in isolation is a tremendous challenge at present, so attempting to bundle them all into a unified, smooth-running process is a singular act of courage by any IT operation. A more practical approach to realizing the ideal of fully automated cloud management is to start with relatively simple self-monitoring functions.

User demands and cloud complexity drive the need for self-monitoring

Two trends are converging to make self-monitoring a necessity for cloud environments: rising user expectations concerning the availability and functionality of the apps they rely on; and increasing complexity of the modern multicloud infrastructure. Mehdi Daoudi states in a July 27, 2017, post on BetaNews that digital experience monitoring (DEM) has become the “ultimate metric.”

Aspects of DEM include real user monitoring (as opposed to “synthetic” or simulated monitoring); geography and network bandwidth; monitoring the “entire delivery chain,” including APIs, social media plug-ins, and other third-party services; and most importantly, keeping a close watch on the performance of cloud service providers, which is the area IT has the least control over.

Figure 5: Application performance monitoring that includes real user monitoring or deep-dive monitoring of container resources requires an agent-based approach, whether active or passive. Source: Computer Measurement Group

The key to monitoring cloud services is having an iron-clad service level agreement. In an August 9, 2017, article, MSPmentor’s Christopher Tozzi identifies four technologies that help organizations confirm that the terms of their cloud services SLAs are being met:

1. Run apps inside Docker containers, which scale easily and are resilient to hardware failures by distributing their services across multiple host servers.

2. Choose serverless computing services that offer “virtually unlimited compute resources” to your applications, even when the rest of the host infrastructure is reaching its power limits.

3. Include incident management tools in your monitoring to ensure timely notification of the parties charged with responding to performance glitches.

4. Make sure the cloud services’ SLAs contain explicit anti-DDOS strategies. This may entail a multicloud approach to minimize the impact of a distributed denial-of-service attack.

Advanced cloud functionality, part two: Explicit notification

In the July 2017 publication Practical Guide to Cloud Management Platforms (pdf), the Cloud Standards Customer Council defines the challenge facing IT operations: “[T]he number of data points needed to gain visibility and the variety of systems used to collect the data.” To meet this challenge, a CMP must offer “a simplified management view through its functionality and the aggregation and integration of data from the multiple cloud environments.”

As the old saying goes, easier said than done. In the area of infrastructure monitoring, CMPs need to make visible “operational data to support SLA management, security alerts, threat monitoring,” according to the CSCC report. In monitoring performance, CMPs must “detect increased latency and identify the source of degradation.”

Figure 6: A reference architecture for hybrid cloud management integrates multiple external data sources, bundling governance and security around core resource, financial and other services. Source: Cloud Standards Customer Council (pdf)

Key to any CMP’s integration is offering “a single pane of glass view of all your subscribed cloud services and deployed cloud workloads within your current and target public, community, and private clouds.” Two other important considerations are whether the CMP requires a software agent or is “agentless” (many services support both agent and agentless architectures); and whether the CMP’s APIs allow its services to be extended in such areas as cloud instance management, user administration, and logging and reporting.

Advanced cloud functionality, part three: Failover/self-healing

A self-healing system requires the ability to determine its current state, to compare the current state to an ideal or goal state, and to make the decisions that will achieve the goal state without requiring any human intervention. The process entails continuous checks and the ability to determine, sustain, and, when necessary, restore the optimal or designated state.

In a chapter from the DevOps 2.0 Toolkit, Viktor Farcic describes three types of software self-healing: application level, system level, and hardware level. System level healing is distinguished from application level by applying to all applications and services rather than depending on a programming language and design patterns that are applied internally. The two common problems in system-level healing are process failures and slow response times.

The standard response to a process failure is to redeploy the service or restart the process. However, determining the cause of the failure usually requires human intervention. When response times fall outside preset thresholds, you either have to scale or descale, depending on whether upper or lower response-time limits are involved. Whether the problem is a process failure or response issue, the self-healing system must be able to recognize the triggering event and respond correctly based on steps required to return the system automatically to the desired state.

A service or application is programmed to report its state periodically to confirm it is working as expected. Time-to-live (TTL) checks are used to record the last reported state of each app or service. If the state isn’t updated within the predefined period, the monitoring system assumes the app or service has failed and needs to be restored to the designated state. Because microservices are designed to have a clear function and a single purpose, adding a TTL function violates the core principle.

An external alternative is to ping the microservice periodically to confirm that it is functioning as expected. For services that expose an HTTP API, a simple request will suffice with a response HTTP status in the 2XX range. Alternatives are pinging by script or another method that can validate the state of a service. Pinging’s advantages over TTL are less repetition, less coupling, and lower complexity.

Figure 7: Pinging offers advantages over TTL for reporting the state of microservices because the checks remain outside of the service itself, thus reducing complexity. Source: Viktor Farcic, via Technology Conversations

Seamlessly integrate resources via cloud service APIs

The tremendous increase in the number of APIs — each with its unique set of features and functions — creates a major challenge for application developers. Products and services need to communicate with each other seamlessly, yet each API presents developers with its own integration model and data structures, including resource definitons, data model schema, error handling, and paging structures. How do you consolidate them all into a single, unified, efficient data model?

Cloud Elements’ 2017 report on the State of API integration (registration required) concludes that, from an enterprise integration perspective, there is great variation in how specific APIs are being used. While cloud storage continues to be the principal driver of API growth, use of APIs for CRM, marketing, and finance increased substantially in the last half of 2016.

Figure 8: The skyrocketing demand for cloud API integration is driven by growth in cloud services for CRM, marketing, finance, and e-commerce. Source: Cloud Elements

The growing number of public and private APIs heightens the need for metadata that describes the APIs to facilitate discovery. Metadata discovery allows data models and resource structures to be accessed and understood programmatically. CRM and marketing are the areas most likely to support discovery metadata because they have the greatest need to accommodate custom data. Discovery interfaces for APIs benefit even static data models because they make it much easier to manipulate data, particularly when transforming data from one source to another.

Event handling is also affected by the proliferation of cloud APIs. For many enterprises and developers, event handling is the most valuable information they collect about their applications. Webhooks are developers’ preferred event-handling method, yet only 29 percent of APIs support webhooks, according to Cloud Elements.

With webhooks, new event data is posted automatically to a user-defined URL monitored by the linked application. The app is updated with the new data as soon as it posts. This is much more efficient than updates via polling, which research by Zapier indicates is successful with fewer than 2 percent of requests.

Four tips for smooth integration of third-party APIs

As mentioned above, CRM is an area that entails establishing links with a many custom APIs. In a June 10, 2017, post on Medium, InSite’s Damon Swayn shares four valuable lessons the company learned from years of integrating existing CRMs with the company’s platform:

1. The time and effort you invest in documenting your APIs are well spent because at some point your customers/partners will ask you to give access to a third party. If bad design or documentation prevents easy access, they will turn to one of your competitors.

2. If you build APIs based solely on what your own apps need, you’re liable to expose more internal data to third parties than you intend to. An example is a CRM app that used an event stream API to retrieve notes and comments about a contact. The result was much more data being returned than expected, which meant the customer had to sift through much more data than they needed.

Figure 9: This model of cloud-based API management separates the responsibilities of the web API gateway and the API management functions from the enterprise’s internal architecture. Source: IBM

3. Polling has serious shortcomings when you need real-time data integration. The polling feature built into some APIs allows you to filter results so only those that have changed since your last access are returned. If no such feature is present, you have to implement scheduled polling, which requires storing and tracing the data returned by the API and then filtering out the data that hasn’t changed.

A better option is the use of RESTful webhooks that activate each time a record is created or updated. An HTTP callback URL can be used to receive a POST request when the callback is triggered. This is about as close to real time as API integration can achieve. Webhooks require less logic to support integration: a webhook message is a single outbound HTTP POST request, and the third party needs only a receiver URL programmed to unpack and handle the POST data.

4. API integrations are basically distributed systems, and as such, they are a challenge to debug. It helps to log before and after key points in the integration code to verify pre- and post-conditions, similar to assertions. This helps pinpoint the source of problems and allows the data to be inspected before and after the method call, so you get a look at what happened around the time of the glitch.

When you’re approaching your API rate limits, it’s usually caused by one of three things: you have a bad integration design; the rate limit is too low for the way you’re using the API; or a code error is causing the API to be called more than necessary. Remote API call logging will let you aggregate information about how often API calls are made, and where the calls go.

Cloud capacity management, part one: Dynamic resource scaling anticipates demand

Use only the resources you need. That’s the promise of cloud efficiencies in a nutshell. When your applications need more compute, storage, or network assets, the resources are available on demand. Conversely, when the apps need fewer resources, the cloud automatically dials back the allocation of processor, memory, and other components.

The reality of cloud efficiencies is somewhat south of this autoscaling ideal. App developers build scaling into the programs via rules based on “common utilization metrics,” as Mike Pfeiffer writes in a July 2017 article in TechTarget. Manual scaling is available “with the click of a button,” according to Pfeiffer. However, not all applications are suited to automatic scaling, and cloud capacity-management tools rarely have an equivalent for scaling on-premises apps.

Virtualization makes it easy for administrators to add memory, storage, or CPU resources to existing virtual machines, whether they run in the data center or on the cloud. The trick is to take the next step and define rules that automate scaling resources up and down as demand dictates. All major cloud platforms support autoscaling, but their end results vary considerably based on the nature of the workloads involved.

Figure 10: AWS Auto Scaling launches new instances in the Availability Zone with the fewest instances, or in a new zone when the first launch attempt fails. If there are multiple subnets in a VPC Availability Zone, a subnet from the zone is selected at random. Source: AWS Documentation

For example, many business applications have long lifespans and inflexible workloads that make them ill-suited to app autoscaling. Conversely, stateless servers whose data persists outside of the autoscaling group are much better candidates for realizing autoscaling efficiencies. Also, apps that require new servers to scale up on demand can be thwarted by slow server boot times.

Cloud capacity management, part two: Use scaling to optimize app performance

It has been five years since Marc Andreessen wrote in a Wall Street Journal essay (subscription required) that “software is eating the world.” Still, software-defined infrastructure projects continue to focus primarily on capacity: allocating network bandwidth so that applications execute on demand with no fear of running out of compute and storage resources.

In a January 5, 2017, article in InfoWorld, Cheng Wu explains that the area overlooked when planning and implementing cloud autoscaling is performance. The increasing complexity of cloud apps and services causes runtime resource synchronization to slow overall performance while leaving capacity unused. According to Wu, what’s needed is “intelligent resource execution.”

Figure 11: Autoscaling configures services so they automatically provision more instances whenever the defined capacity limit is reached, but in the brief time before autoscaling kicks in, the service is throttled to ensure it remains available. Source: Tarun Kumar Sukhu, via Computer.org

Autoscaling techniques must do more than simply determine whether storage capacity is being provisioned cost-effectively, or whether sufficient bandwidth is allocated to accommodate surges in inter-VM or inter-cloud traffic. Important cloud performance metrics include the app’s ability to use parallel execution in CPU cores, networks, and I/O capacity.

In a typical scenario, an application uses an infrastructure API to specify the correct resource types and SLA, but the API is tied to a single cloud platform, which defeats the goal of inter-cloud transportability. Many cloud apps rely on multithreading and vCPU pinning to take advantage of scaling across multicore CPUs. This approach hinders performance by introducing kernel contention, layers of abstraction, and other inefficiencies.

Some cloud apps are designed to leverage software-defined virtual devices, which bypasses the OS kernel and reduces hypervisor and container overhead. However, the resulting application- and OS-level resource contention become so complex that infrastructure provisioning can no longer be differentiated solely by an SLA metric. What’s needed is a way to notify the cloud infrastructure and OS that resource contention is hindering the app’s performance. Once that’s in place, the app specifies the resources required to alleviate the performance bottlenecks.

Requirements of a dynamic resource allocation framework

The method used by an application to execute process threads relates directly to the source of resource contention. Machine learning techniques can be used to analyze the data path and pinpoint bottlenecks encountered by vital software threads. Once points of contention are identified, new resources are provisioned automatically, without requiring any action by the application.

Such a dynamic resource allocation scenario requires adherence to standards, as well as use of software-defined networking and software-defined storage. An example of a resource abstraction layer followed by a dynamic resource resolution layer is a Linux app’s sockets serving as the resource abstraction for a logical network connection. The kernel or hypervisor dynamically resolves the network path and network address that need to be translated.

Cloud apps running on multicore processors need an application-specific, on-demand resource allocation framework that offers per-process or per-thread granularity. This is the only way to orchestrate per thread resource provisioning that distinguishes important “elephant” threads from less-important “mouse” threads. It also allows scarce infrastructure resources to be allocated to the elephant threads having the greatest impact on performance.

IT’s role as conductor of the cloud-app orchestra

In the past eight years of Global Knowledge’s annual IT Skills and Salary Reports, cloud computing has gone from being non-existent (no mentions at all) to being the area of greatest interest for IT pros. Initially, IT staffers were concerned that a shift to cloud infrastructure from in-house systems would lead to consolidation and layoffs. As most people in IT know, quite the opposite occurred.

The ascendance of the cloud has led to more opportunities for IT workers, and new roles for existing staffers with cloud expertise. Global Knowledge’s Ryan Day writes in an August 22, 2017, article that despite the shift in focus to training for cloud skills, IT departments are challenged to fill their open positions: 28 percent of the IT managers responding to the 2017 survey report difficulties in finding workers with cloud experience, up from only 6 percent in the 2016 survey.

Figure 12: In eight years of annual IT surveys, cloud computing went from zero mentions to be the highest salary by functional area, second in top interests, and third in most difficult hiring area. Source: Global Knowledge

One of the important new roles IT has taken on is to work with a range of cloud service providers: Some are the big names that provide the cloud infrastructure; increasingly, the services offer specialty cloud skills that fill important roles in the organization’s cloud strategy. Christine Cignoli writes in an August 23, 2017, article on DZone that managing relationships with multiple cloud providers requires a new set of skills.

In particular, overlapping service level agreements complicate the important task of ensuring that cost and performance guarantees are being met. Drastic changes in pricing structures and the transition from CapEx to OpEx spending mean IT managers need to hone their negotiation skills and vet all cloud agreements carefully.

‘Cloud management’ and ‘information management’ are becoming indistinguishable

For IT managers, the challenge is not joining the cloud, it’s leading it. IT serves as educator, mentor, and host (as in “emcee”) on their company’s journey to the cloud. The new roles give IT managers a chance to work more closely with their business unit customers, with business partners, and with other third parties. The best way to know what your customers need is to be working with them shoulder to shoulder.

Enabling the promise of new technology so it helps guide your company to achieving its goals: That’s been the job of the IT department for as long as there has been an IT department. In 10 years, cloud computing will have morphed into a form not many of us can imagine today. Yet the IT department will be front and center, giving workers and managers the tools they need to succeed.

X