Big Data — Big Privacy Hole

14 min read Last updated Jul 25 2024

#privacy #security #whining #data-engineering

I'm sorry, but my blog won't be complete unless I write something about privacy. This post was supposed to be about personal privacy but under the influence of alcohol has slipped into big data, so do not judge and read the following article about personal privacy.

Arguing that you don't care about the right to privacy because you have nothing to hide is no different than saying you don't care about free speech because you have nothing to say.
― Edward Snowden

You have something to hide

"Who cares about me?"

"I have nothing to hide"

"I don't have anything important, why do I need privacy?"

The argument "I have nothing to hide" is ridiculous and moreover, it destroys a person's motivation and reinforces the uselessness of any further efforts. Often the reason for that argument being said is because a person is unaware of his or her assets. A person diminishes the importance of the information he or she possesses or processes. It must be understood that privacy is not just about having something to hide, it's also about keeping information available only to yourself. Some information just shouldn't concern other people.

It's not only about hiding but also about protecting. You defend your assets — it is natural and absolutely clear. Among digital assets, passwords, and PIN codes are the most obvious. You don't tell anyone the password from your email box. You are unlikely to be delighted if someone steals your Facebook account and starts spreading offensive messages, viruses, or spam. You are not hiding the facts here — yes, you have a bank account, an email, and a Facebook account. It is not a secret. You are protecting what's listed from unauthorized, arbitrary access, and from using your data against your interests.

We don't make it a secret that we own a car, but we use a seatbelt to protect our health and life. These are undeniable physical assets. The same goes for online privacy — you would like to tell your friends where to find you on Sunday night, but you don't want a maniac with skills to know about it — you would probably reschedule your plans.

People who "have nothing to hide" easily spread their careless approach to others. They do not wonder what a complex and interconnected world we live in today. Remember that we are in close communication with others who may have their own assets and rules. You don't want to set people up, your loved ones, your friends. Especially if they are the people who are vulnerable and belong to a "risk group". People often ask this question when it comes to personal security — something like "Why would a hacker hack me?" without realizing that maybe the hacker is not interested in them.

Going into the digital era, people forget that crime is also going into the digital era. Earlier people were robbed on the streets, now the robbers choose to be on the Internet and use the same information that people themselves voluntarily made public. Those people who say that they have nothing to hide and easily publish their photos or phone numbers on the internet expose themselves to very high risk. Usually, when people think about their information on the Internet, they think about their IP address and search history, but in reality, the rate of keystrokes is recorded, how you move the mouse, what routes you take to work, through the camera it is possible to analyze where you look on the screen — and this is just the tip of the iceberg.

I am surprised by the sluggishness of people — there are so many fans of conspiracy theories in the world. The average inhabitant of the planet is easier to live with the idea of a universal conspiracy of manipulating the consciousness of mankind via microchipping than to realize that it has existed for years already. Information is already being recorded by everyone, and there is a global manipulation of behavior. But it's not comfortable for people to think about it.

Companies

Cybercriminals are actually a smaller problem, they're not really interested in everyone. The most dangerous criminals are corporations and marketing experts — they want everyone's information.

The new business model for many companies is to provide free content or services in exchange for a person's data. Many of us accept this — we often accept long and confusing documents with terms and conditions — in fact, we do not object to some information being collected in exchange for free service. With increasing digitalization, we unconsciously giveaway a huge amount of information about ourselves every day, receiving much less than we thought in return.

We are also losing one small detail. Since our data is then stored in proprietary repositories, out of our sight, we lose direct control over that data and choose when and with whom to share it. Moreover, we often do not have the opportunity to revoke access to those data for companies that we would rather not share, especially with third parties — it is just all or nothing. Are you sure that all the services you are using are concerned about security and will never use your data against you? Well, I'm not sure.

And I haven't said anything about the government yet, in the post-cambridge-analytica-snowden world I don't know what I can add there.

Big Data

The whole life of any person nowadays is tied to devices, social networks, to virtual life. And with the growing number of IoT, devices everywhere our role has changed — a person became a sensor of a system that belongs to someone else. Each of us has a very big digital footprint, which is much more than we are shown or thought to be. And all our data is created and collected by applications and it is stored not on our phones, but in the clouds, on remote servers around the world, owned not by you and me, but by corporations that collect this data for their own purposes. Collecting this data is not the worst problem for us as users, but leveraging big data and intelligent analysis technologies together opens up the really scary side of big data.

Big data analysis systems work with huge amounts of information. The more unique (read "more private") this data is, the more interesting and intimidating conclusions can be drawn from them using modern methods. Imagine that any company you are associated with has four types of information about you. The first one is Personally Identifiable Information (PII) — this is information that allows anyone to directly identify and contact you, such as your name, social security number, email address, phone number, and so on. The second type is Quasi-Identifiers (QI), which in itself is information that is too general and does not provide any insight into you, but in combination, these QIs uniquely identify 87% of the US population. This is information such as ZIP code, age, gender, etc. There is also Sensitive information that you would like to tell only your wife and doctor, better only your doctor, it could be preferences, salary, sickness, etc. And the last type of information let's call it "everything else" — anything that does not fall under the definition of the others.

Despite all the GDPR, CCPA, LGPD things that have occurred recently, personal privacy is still not protected. Obviously, simply removing the PII columns from the dataset is not enough to protect privacy (which is what is behind all of these acronyms). Even if basic demographic data (which qualifies as QI) is present in the dataset, it can be combined with other publicly available data sources and identify people with great accuracy. The use of big data technologies and analytics blurs the legal and technical limitations with which people maintain privacy.

Another type of harm occurs as a result of combining small fragments of seemingly harmless information. If such information is combined, it can tell much (much) more. By combining pieces of information that we may not care much about protecting, the government and companies may collect information that we would like to hide. For example, suppose you bought a book about cancer. This purchase alone will say little about anything, as it simply indicates your interest in the disease. Let's say you bought a wig. You can buy a wig for many reasons. But if you combine these two pieces of information into one, one can assume that you have cancer and are undergoing chemotherapy. You might not want to tell others this information, but you definitely want to have a choice.

Have you ever questioned the nature of your reality?

All small businesses competing with big data companies are like playing chess with a computer and hoping for luck. They play and think they are really cool, but the computer has already seen millions of games and it has a mathematical probabilistic model that says clearly how and when it will win if it does a particular move. And those small businesses have lost already, they just don't know it yet.

Data is the most valuable thing on the Internet. It's the new gold, new oil. If you work with the data properly, you have the advantage. The market is literally killing competitiveness — you want to compete with Amazon? An innovative startup maybe 100 times better and cooler than Amazon, but technically it doesn't matter. A company with a lot of data and a lot of money will take a look at this startup copy the most interesting things and most importantly apply the experience of X billion people to it. It's already known how consumers will behave in different situations. Those who do not have this knowledge will eventually die.

Everybody must have read Orwell's 1984, which has posters with "Big Brother is watching you". Imagine if you finally open the door to Big Brother's office, what would you see there?

Actually, it is artificial intelligence.

Have you ever thought about the origin of your desires? How do Facebook, Instagram, advertising billboards, and even your favorite TV show influence them? UX has become so good that it just fades into our heads with everything else that companies want because they know us better than we know ourselves.

When we implement ML algorithms and techniques, the problem or task reaches a completely different level. This is due to the fact that — unlike traditional query/triangulation approaches, ML scenarios can combine a huge number of input parameters/functions in arbitrarily complex ways and therefore can violate privacy in ways that are still unknown and mysterious.

And if a government or commercial company integrates analytical software into a product or service that you use and you don't like it, you can't just quit. No one will ask you if you agree to be part of this research or not. Moreover, you are unlikely to be told at all that you are part of such research.

Don't get me wrong — I'm not saying that all these shortcomings should make us give up advanced analytical algorithms that often make our lives easier. Data management as an industry is now at the very beginning of its journey — it is definitely not going anywhere and will stay with us for a long time. However, now is the time to think about all these problems before it's too late.

We need secure algorithms with transparent data processing mechanisms and self-explanatory decision-making. Independent researchers need to be allowed in the source code and governments need to create appropriate legislation. It is also a good idea to tell people what is behind one algorithm or another. I believe that the most important and obvious problem in the data management field as a whole is education, education in terms of knowledge, and education in terms of people's responsibility and understanding in working with other people's personal data.

The next related problem is data reuse. This is the processing of information obtained for certain purposes, for which consent from the person was not obtained. How long will personal information be kept? How will it be used? How will it be used in the future? The potential use of any part of personal information is limitless.

For example, running mobile applications very often gathers data to understand, for example, a person's income and in general, have a lot to say about a person. Look at the permissions that the Facebook app requires to run — do you think this app really needs that much to run your news feed? Another example is applications that sell you something for the same person and give a different price tag on different platforms. They understand the platform, whether it's an Android or ios and understand the price a person can pay, clustering the person by country, understand what else is installed on the phone (Tinder, Pinterest, Discord, etc.) — and then give you a product from the narrow niche in which you eventually fell. And of course send all this data further to someone else, who will give more money.

The next problem related to using personal data is data misrepresentation. While personal information can tell a lot about the personality of the person and their activities, it often fails to characterize the person as a whole. It should also be considered that raw data may contain errors or a lack of information that is critical to making the right decision. It may present a distorted picture of what is happening. For example, imagine a police officer entering a criminal area. The algorithm warns him that the person in front of him with a 51% probability is a murderer. This man has a suspicious package in his hands. But did the program take this into account in its analysis? Does having a man's package make it more suspicious or not? 51 percent is high or low? It's all questions without answers.

Conclusion

Without privacy, there was no point in being an individual.
— Jonathan Franzen

Around the data management area, there is a euphoria of coolness, and not many people focus on the problems it brings. The main one of which is education.

One of the biggest advantages of big data — unbiasedness — does not really work. The decision made on the basis of calculations made by algorithms that are created by people, together with selected data made by people, in the end still remains a human decision, hence biased. There will always be mistakes, because ML models are, by their very nature, simplifications.

I don't want the privacy in big data to end up with the fact that we will have some public characteristics like a baseball player has, aka social score which transparently shows your goodness, but it is better to have such fair models because they are transparent than have nothing at all.

I hope we haven't lost our privacy yet, but "data-driven" strategies from all sorts of companies are bringing this moment closer and closer. But there is also a backside which I'm sure there will be more soon, new tools for anonymization, and protection of user data will appear besides AWS Macie, Apache Ranger, and Apache Atlas. On the ML side, there are more and more activities for the user's privacy while maintaining the quality of the product — you can google words like k-anonymity, l-diversity, t-closeness, and federated learning. So let's wait for the next data management hype.

All I have written are meaningless words and I'm absolutely not the first one who talks about it — we have to take steps toward education, privacy, and data security. It's obvious.

Additional materials

Liked this? I publish one deep-dive every week.

Join 4,000+ engineers. No BS.

Get the newsletter