Blog | luminousmen

I P

The 5-minute guide to using bucketing in Pyspark

There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not...

Spark tips. Don't collect data on driver

There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not...

How to not leap in time using Python

If you want to display the time to a user of your application, you query the time of day. However, if your application needs to measure elapsed time, you need a timer that will give the right answer even if the user changes the time on the system clock. The system clock which tells the time of...

Spark History Server and monitoring jobs performance

Imagine a situation that you wrote a spark job for processing huge amount of data and it took 2 days for this job to finish. It happens. Actually, it happens regularly. To tune these jobs engineers need information. It can be obtained from spark events(if you run something on a cluster in Spark...

Spark tips. DataFrame API

There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not...

Schema-on-Read vs Schema-on-Write

When we talk about working with data, we usually doing it in a system that belongs to one of two types. The first of them is a schema-on-write. Schema-on-write Probably many of you already have worked with relational databases. And you understand that the first step to working with a relational...