Imagine a situation that you wrote a spark job for processing huge amount of data and it took 2 days for this job to finish. It happens. Actually, it happens regularly. To tune these jobs engineers need information. It can be obtained from spark events(if you run something on a cluster in Spark...
Sun Oct 13 2019 |
2 minute read
There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not...
Sun Oct 06 2019 |
9 minute read
When we talk about working with data, we usually doing it in a system that belongs to one of two types. The first of them is a schema-on-write. Schema-on-write Probably many of you already have worked with relational databases. And you understand that the first step to working with a relational...
Sun Sep 29 2019 |
5 minute read
There are a lot of engineers who have never been involved in statistics or data science. So, to build data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For these Data/ML...
Fri Sep 20 2019 |
7 minute read
Apache Spark supports many different data formats, such as the ubiquitous CSV format and web-friendly JSON format. Common formats used primarily for big data analytical purposes are Apache Parquet and Apache Avro. In this post, we’re going to cover the properties of these 4 formats — CSV, JSON,...
Sun Aug 18 2019 |
13 minute read
In order to support emoji on Linux, we will be using Noto Color Emoji font. The script is simple: wget https://noto-website.storage.googleapis.com/pkgs/NotoColorEmoji-unhinted.zip sudo mkdir -p /usr/local/share/fonts/truetype sudo unzip NotoColorEmoji-unhinted.zip -d...
Sat Jul 20 2019 |
1 minute read