Visum — A Cloud Cost Optimization Platform. Part 1

5 min readSep 28, 2022

Background

The worldwide infrastructure as a service (IaaS) market grew 41.4% in 2021, to total $90.9 billion, up from $64.3 billion in 2020. It is expected to be as high as $121.62 billion in 2022.

Cloud adoption is still growing, and the vast majority of enterprise organizations and more than 50% of SMBs spend more than a million dollars a year on cloud computing.

Problem Statement

1. Waste

Companies struggle to use cloud resources efficiently. A third of cloud computing investments are being wasted according to the Flexera survey of over 750 businesses.

2. A Visible Waste

The whole idea of Cloud is elasticity. Cloud will provide as many resources as one requests. Mistakes in resource sizing, scaling in/out, keeping idle resources will result an overspending. This problem is well known. A bunch of companies provide software for financial management of cloud resources.

3. An Invisible Waste

It is ‘normal’ for Cloud Native Applications to be suboptimal from performance point of view. Cloud Native Applications are designed for scalability and run on Cloud where resources are virtually unlimited. When a Cloud Native Application reaches 100% of CPU utilization on production it is easier to scale out resources than execute a hard and time-consuming performance optimization activities. Performance optimization is done only when the cloud cost skyrockets.

4. Need or Desire

40 Billion of US dollars that are wasted yearly, energy efficiency, reducing carbon footprint — all these are good reasons to invest in Cloud Cost Optimization tools.

Visum

Visum is a Cloud Cost Optimization Platform that uses AI techniques to expose and handle an Invisible Waste of Cloud Native Applications. The first version of Visum takes a bottom up approach and focuses on cost optimization of Apache Spark Applications that are running on AWS. However, Apache Spark Applications are a private case of Cloud Native Applications. Later, Visum can be extended to handle any Cloud Native Application.

Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

Apache Spark Applications on Cloud

ESG research found that 43% of respondents considering cloud as their primary deployment for Apache Spark. And it makes a lot of sense because the cloud provides scalability, reliability, availability, and massive economies of scale. Another strong selling point of cloud deployment is a low barrier of entry in the form of managed services. Each one of the Big 3 cloud providers comes with its own offering to run Apache Spark as a managed service.

Apache Spark Applications and Cloud Costs

Running high scale Apache Spark Application on Public Clouds is expensive. For example, when running on AWS EMR applications are billed for execution time * number of cores * price/core/minute + cost of storage + managed service fee. Data Processing of 1 TB of daily traffic can be as expensive as 1M USD / year. In such a scenario, any problem or bad code change that adds 10% to the execution time, will cost additional 100K USD/year. So, the intensive of keeping costs under control is very high.

Apache Spark Applications and Costs Monitoring

Spark Applications use a dynamic allocation policy, meaning that applications during the runtime may request resources when there is a demand and give resources back to the cluster if they are no longer used. Some applications will do such an allocate/release cycle a couple of times during the runtime. Spark Listeners and Spark UI can help with cost tracking and observability. The following article shows how Spark Listeners and Spark UI can be used for observability and tracking cloud costs.

Costs Monitoring and Optimization

Performing cost optimization of Apache Spark is a hard task. There are a lot of things that can potentially go wrong: executors can fail, latency to external data sources can increase, the nature of the input data can change, wrong usage of cloud APIs, problems in JVM and many more. One should be a word class expert in all 3 domains: JVM, Apache Spark and Specific Cloud Provider like AWS, Azure, GCP.

What Visum Solves

Essentially there are three problems when dealing with Spark Applications Costs:

Reactive way of work. Dev Teams reacts to the past rather than anticipate the future. Cost optimizations process starts only when the cloud cost skyrockets and a lot of money already wasted. In fact, ‘small’ problems like 10% of additional execution time (can be 100K USD/yearly and more) are never handled at all.
The highest level of expertise is required. One should have a deep and comprehensive knowledge in all three domains: Apache Spark, AWS APIs and JVM in order to optimize the cost of Apache Spark Applications. In addition, when working with PySpark it is necessary to master Python as well. It is rare when one person has all these skills.
Long time to fix. Even when detected, cost optimization issues are not prioritized for handling. Dev teams work on what is urgent, like designing and developing new features, handling product bugs, etc. Cost Optimization tasks fail into the non-functional tasks bucket. Such tasks are hard to justify without having the exact dollar figure for money waste. And even when prioritized, it takes time to find the problem, perform code fix and deploy to the production.

The whole idea of Visum is to find ‘bad patterns’ automatically. To do so, Visum intercepts events from Spark scheduler, JVM, AWS, and performs data stream analytics in real time. Visum performs all steps that usually performed by analytics pipelines: ingestion, normalization, enrichment and pattern recognition.

In the final step, Visum generates two reports. The first one is a detailed report of detected issues with estimated wasted cost of each issue, reference to the source code where the issue happened and link to the knowledge base that explains the problem.

Summary

In the next chapter I will show how Visum automatically recognizes such Apache Spark performance issues as Data Skew, Multiple Evaluation, Expensive Shuffle and more.