My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.
In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.
- Data Engineering skills
- Data Challenges in Big Data
- Management Challenges in Big Data
- Analytical Challenges in Big Data
- Operational Challenges in Big Data
Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.
Search and information retrieval
The information retrieval system is a network of algorithms that facilitate the search for relevant data/documents on user demand.
In order to perform an effective search on large amounts of data, it is not advisable to perform a simple scan — and then various tools and solutions appear. One of the most frequent tools that I see is ElasticSearch. It is used for Internet search, log analysis, and large data analysis. ElasticSearch is more popular because it is easy to install, scalable to hundreds of nodes without any additional software, and easy to work with thanks to its built-in REST API.
In addition, the well-known tools are Solr, Sphinx, and Lucene.
Data Governance is kind of the umbrella term that's used for saying "I want to keep control of my data". This is probably one of the important areas of Big Data that is still undervalued and does not have good solutions in my opinion. The goal of data governance is to establish methods, responsibilities, and processes to standardize, integrate, secure, and store data. Without effective data governance, data inconsistencies in different systems of the organization will not be eliminated. This can complicate data integration and create data integrity issues that affect the accuracy of business intelligence, corporate reporting, and analytics applications.
I'm certainly not an expert in this field, but the tools I've seen here are Informatica, Talend, Semarchy.
Constantly increasing volumes of data pose additional challenges to their protection against intrusions, leaks, and cyberattacks, as the level of data protection does not keep pace with the growth of data, vendors, and people. Comprehensive, end-to-end protection involves not only encrypting data throughout its lifecycle — at rest and in transit — but also securing it from the very beginning of the project. As you can see this affects all the aspects we are talking about in this article and, like everything about information security, it is difficult to do right.
The emergence of privacy laws such as GDPR, CCPA, LGPD has serious consequences for non-compliance. Businesses must take the confidentiality of data into account. And the presence of specialists in these areas becomes a necessity.
Among the tools worth paying attention to are Apache Kerberos, Apache Ranger, Apache Accumulo.
Typically, within the company, we have a lot of data in different forms, storages, formats, with different degrees of access to them. To find the data you need to either know exactly where to find it or know where to start looking (if there is such a place). This is where the so-called data directory or Data Catalog comes into play.
Management of corporate data sources is an essential process that is based on information known to various limited groups within a company. But it is not easy to go and collect all metadata about the data stored inside the organization and manage it — people come and go, data is removed and added. Hence, building data catalogs is an important but complex task.
Among the well-known tools here are Apache Atlas and solutions from cloud providers.
Check out my book on a roadmap to the data engineering field.