Cost Efficiency and Portability are the main reason to migrate Apache Spark workloads from managed services like AWS EMR, Azure Databricks, or HDInsight to Kubernetes. You can learn more about the migration process from AWS EMR to K8s in the following article. However, there are also potential pitfalls with leaving managed services. And probably, the biggest one is losing monitoring and alerting capabilities. For example, AWS EMR has a really rich built-in monitoring toolbox in the form of CloudWatch, Ganglia, CloudTrail, and YARN history server. …


Amazon EC2 provides a broad portfolio of compute instances, including many that are powered by the latest-generation Intel and AMD processors. AWS Graviton2 processors add even more choice. AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to enable the best price-performance for workloads running on Amazon EC2.

On Oct 16, 2020, Amazon announced that EMR now supports Amazon EC2 M6g and provides up to 35% lower cost and up to 15% improved performance for Apache Spark workloads on Graviton2-based instances versus previous generation instances.


Obviously, these are great news for all EMR users. Especially for those…


One of the 5 pillars of the Well-Architectured Framework is Cost Optimization. The Cost Optimization pillar focuses on avoiding unnecessary costs, selecting the most appropriate resource types, analyzing spend over time, scaling in/out in order to meet business needs without overspending. On one hand, it is pretty clear why any organization must dedicate the necessary time and resources to continuously perform cost optimization tasks, on other hand, it is really hard to drive such a change, especially in organizations that are new to cloud technologies.

One of the best ways to start with the Cost-Aware culture is to do it…


ESG research found that 43% of respondents considering cloud as their primary deployment for Apache Spark. And it makes a lot of sense because the cloud provides scalability, reliability, availability, and massive economies of scale. Another strong selling point of cloud deployment is a low barrier of entry in the form of managed services. Each one of the ‘Big Three’ cloud providers comes with its own offering to run Apache Spark as a managed service. You probably heard about AWS Elastic Map Reduce, Azure Databricks, Azure HDInsight, Google Dataproc.

I will focus on AWS Elastic Map Reduce since we are…


COVID19 made a huge impact on the world. Every single aspect of our lives went through a massive change and just like many other software development teams in the world, our team went fully remotely. I have no idea when our office will be open, and probably, offices may never be much of a thing again and that’s why it is better to be ready for working remotely for a long time. There are a bunch of good articles on the web about transitioning to fully remote work, my favorite one is “The Remote Manifesto” by GitLab. In addition, if…


Monitoring and alerting is a mandatory part of any software system running in a production environment. To keep software systems healthy, to optimize performance and resource utilization, you need a unified operational view, real-time granular data, and historical reference. Here, I will show how a monitoring system for distributed Jetty servers running on K8s can be built by using Prometheus and Grafana.


Jetty is a free and open-source Web server and servlet container. Even though Jetty’s market share is nowhere near Tomcat’s, it’s still widely used in the industry. If you are using Tomcat, here you can find the step…

Services for building Big Data pipelines on AWS


This article is all about moving data into Big Data Pipelines running on AWS. Since most data pipelines have 5 steps in common: collection -> storage-> processing -> analysis-> visualization, AWS has a very solid foundation for building all these steps. For example, when it comes to data collection step you can use the following services:

  1. Real-Time pipeline: Kinesis Data Streams, IoT, Simple Queue Services and Managed Streaming for Apache Kafka.
  2. Batch pipeline: Snowball, Database Migration Service

There are some cases that aren’t covered though. Consider a batch processing data pipelines having a lot of different data sources that are…

Even Distribution vs Distribution With Skew


One of the well-known problems in parallel computational systems is data skewness. Usually, in Apache Spark, data skewness is caused by transformations that change data partitioning like join, groupBy, and orderBy. For example, joining on a key that is not evenly distributed across the cluster, causing some partitions to be very large and not allowing Spark to process data in parallel. Since this is a well-known problem, there is a bunch of available solutions for it. In this article, I will share my experience of handling data skewness in Apache Spark.


In this specific example, we will talk about an…

“Define user behavior with Python code, and swarm your system with millions of simultaneous users” —


Vertica is a very powerful analytic database designed to handle massive amounts of data. When configured right, Vertica enables a great query performance even in very intensive scenarios. Some Vertica users report that their Vertica cluster handles thousands of concurrent users running sub-second queries. Designing and configuring a database with such a great performance requires a lot of knowledge and experience and it is out of the scope of this article. But once the cluster is designed, configured and loaded with data, another…


Building a data pipeline that handles 1,000,000 and more events per second is not a trivial task. To handle such big traffic, all data pipeline components should be designed and implemented properly. Fortunately, not all data pipeline components should be built from scratch. An open-source community offers a bunch of very solid solutions. In this article, I will show how a data ingestion component, which is one of the most volume-sensitive parts of data pipelines, can be built without writing a single line of code, just by using free and open-source building blocks.


A modern data pipeline should contain all…

Dima Statz

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store