Analytical Challenges in Big Data

My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.

In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.

Whole series:

Analytics and BI is an approach for making data-driven decisions and provides information that can help businesses. Using this level of technology, you can launch queries to answer questions that a business asks, slice data, build dashboards, and create clear visualizations.

With more data, you can make both more accurate forecasts and more reliable solutions and build new ones, climbing higher and higher on the ML ladder.

Machine Learning

Machine learning, specific methods of analytics that allows you to create models that can analyze large and complex data and make predictions or decisions without being explicitly programmed to do so. More and more organizations use ML methods to complement their daily operational analytics and normal business operations.

In the past, ML was limited to some extent by the fact that data scientists could not evaluate and test the solution before a team of data engineers deploys it into production. In fact, most organizations had a traditional BI/analytics team, followed by a separate data science team and data engineers team. These skillsets have now begun to overlap and those teams work together more thoughtfully as Big Data slowly crawls towards analytics and building knowledge-based on Big Data. Because Big Data is too big without the help of ML. Therefore, understanding at least the basic concepts of ML as I think is required. Of course, special attention should be paid to the things on which it relies like statistics, ML approaches optimization methods, Bias/Variance, different metrics to understand (this is actually important), etc. In applied Machine Learning basically, you need to understand why everything works, the formulas are not important, but usually, those who do not understand the language behind the models make very stupid mistakes.

There's a lot more to say, I'll leave it for another time. There are many areas inside ML — NLP, CV, Recommender Systems, knowledge representation, etc, but usually, when you understand at least the beginning, you already understand what you do not understand, so of course, you can go as deep as you want.

If you want to be a machine learning engineer, make sure you know Python. It's the lingua franca of Machine Learning. Then it's worth learning to understand different kinds of frameworks for working with data like NumPy, Pandas, Dask, and the already mentioned Apache Spark. And of course, the most popular ML libraries: Scikit-Learn and XGBoost.

I think everybody understands that really important directions in ML in one way or another have long been related to Deep Learning. Classic algorithms do not go anywhere of course. In most cases, they are enough to make a good model, but the future, of course, lies in the neural nets. The magic of deep learning is that it becomes better with more data. Plus, it worth mentioning words like transfer learning, 1cycle policy, Cuda, and GPU-optimizations can be added here.

The most popular libraries here are Pytorch with fast.ai, and TensorFlow with Keras (I think we will bury it in the near future, but anyway).

Distributed ML

Another thing worth saying about is Distributed ML. As I said Big Data is slowly headed towards more sophisticated analytics on Big Data. Large datasets stored in a central repository require huge processing and computational demands, so distributed ML is the right direction, although it has a lot of problems.

I personally interested in this approach, but it really doesn't matter to anyone but huge corporations. Model accuracy is incredibly important to them, which can only be taken from creating huge models with millions of parameters and a huge pile of data. For all the others, as I said the classical algorithms on subsets or pre-aggregated data are quite good for practical applications.

I will mention a couple of tools that work — Spark Mllib, H2O, Horovod, Mxnet, DL4j. Perhaps Apache Ignite.

Real-time analytics

While organizations generally value real-time data management, not all companies go for analysis of large data in real-time. Reasons may vary — lack of experience or insufficient funds, fear of related problems, or general reluctance of management. However, those companies that implement real-time analytics will gain a competitive advantage.

Tools here are Apache Spark Streaming, Apache Ignite Streaming, Apache Flink, AWS Kinesis.

Automation in Data Science

To somehow automate data preprocessing, feature engineering, model selection and configuration, and the evaluation of the results, the AutoML was invented. AutoML can automate those tasks and can give an understanding of where to continue research.

It sounds great, of course, but how effective is it? The answer to this question depends on how you use it. It's about understanding where people are good at and where machines are good at. People are good at connecting existing data to the real world — they understand the business area, they understand what specific data means. Machines are good at calculating statistics, storing and updating states, and doing repetitive processes. Tasks like exploratory data analysis, preprocessing of data, hyper-parameter tuning, model selection, and putting models into production can be automated to some extent with an automated machine learning frameworks, but good feature engineering and draw actionable insights can be done by human data scientist that understands the business context. By separating these activities, we can easily benefit from AutoML now, and I think that in the future AutoML as a thing will replace most of the work of a data scientist.

What's more, I think it's noticeable that the industry is undergoing a strong evolution of ML platform solutions (e.g. Amazon Sagemaker, Microsoft Azure ML, Google Cloud ML, etc) and as ML adoption grows, many enterprises are quickly moving to ready-to-use ML platforms to accelerate time to market, reduce operating costs and improve success rates (number of ML models deployed and commissioned).

Visualization & BI

Because of the type of information processed in Big Data systems, recognizing trends or changes in data over time is often more important than the values themselves. And data visualization is one of the most useful ways to make sense of large numbers of data points. It helps tell stories by guiding the data in an easy-to-understand form, highlighting trends and deviations.

Unprocessed information from various sources through BI is converted into convenient and understandable analytics. BI systems can be applied in any industry or area of activity — at the level of the company as a whole and for divisions or individual products.

The most popular Visualization & BI tools, as it seems to me here is Tableau, Looker, Microsoft Power BI, Qlik in addition to all the others described above technology stack.

Tableau is a powerful tabular BI and data visualization tool that connects to data and allows you to perform a detailed, comprehensive analysis, as well as drawing charts and dashboards.

Looker is a cloud-based BI platform that allows you to query and analyze large amounts of data using SQL-defined metrics after configuring visualizations that tell a story about your data.

Another visualization technology commonly used for interactive work with data is the "notebook." They allow for interactive research and data visualization in a format that facilitates sharing, presentation, or collaboration. Popular examples of this type of visualization interface are the Jupyter notebook, Apache Zeppelin, and Polynote.