When the data of more than 500 million Facebook users surfaced earlier this year, the company was quick to announce that this was not a security breach. Indeed, it wasn’t. All the data was “scraped” from the website without the permission of the social media giant. Facebook might not get away as easily as they think. They could’ve prevented this scraping. However, the incident should give all of us something to think about.
Attackers don’t need to illegally access a company’s secured systems these days to find out about someone. We are very generous when it comes to sharing our data. Even privacy regulations like the GDPR or CCPA cannot protect our data if we give them away voluntarily. That makes it easy to target individuals, or use the data to initiate a phishing attack on a company or organisation. All an attacker needs is a little open-source intelligence (OSINT) – and the tools for web scraping.
While most will agree that unregulated data harvesting could be used for malicious purposes, many benefits of the ethical usage of data scraping cannot be overlooked. Digital businesses use web scraping tools to monitor user habits and purchase history to create a personalized experience for users. Search engines use it to deliver relevant search results. It has also led to huge advancements in the areas of machine learning and artificial intelligence.
Unfortunately, there is no uniform and internationally accepted laws and regulations that deal with web scraping. While the discussion about regulations continues, the industry itself focuses on self-regulation and the education of internet users. We spoke with Karolis Toleikis, CEO of IPRoyal, to explain the basics of web scraping.
Cyber Protection Magazine: What is web scraping?
Karolis Toleikis: Web scraping, also known as web harvesting or crawling is the extraction of data from websites and converting it into a structured format (like spreadsheets) for the user. While manual web scraping is possible, automated tools are preferred because they save time.
Cyber Protection Magazine: Isn’t web scraping against the idea of privacy? After all, I only gave my data to a particular provider, not those who are scraping it.
Karolis Toleikis: When a user publishes something publicly it is fair game. The automated tools just make it easier.
Cyber Protection Magazine: How can attackers utilize the data they scraped (e.g., phishing)?
Karolis Toleikis: Email addresses, birthdays, addresses, phone numbers, lists of friends. Any publicly available information. The gathering is not illegal. With that they can target individuals or businesses with scam emails to get them to reveal more personal information. Those follow-up activities and how the gathered data is used is where the crime is.
Cyber Protection Magazine: How can individuals protect against scraping?
Karolis Toleikis: There are many things website owners and administrators can do to keep your data safe. For one thing, don’t post your full birthday on Facebook. It’s fun to get birthday wishes, but your birthday verifies your identity.. The best you can do as an individual is to make sure you don’t share any private data with websites that don’t guarantee it will stay private. Also, don’t post any personal information in publicly available places. That’s the only way to make sure your data stays away from malicious actions. (Editor’s note: yesterday a venture capitalist asked us why he was getting notifications from a social media site about the problems he was having accessing his account. He wasn’t, but someone was trying to guess his password to access his social media contacts and impersonate him. We recommended activating multi-factor authentication immediately.)
Cyber Protection Magazine: How can companies protect their customer or company data from scraping?
Karolis Toleikis: If a company requires a login to access any sensitive data, scraping becomes nearly impossible. Using CAPTCHAs successfully separates human visitors from bots, so it’s another great way to prevent data gathering. Finally, regularly updating the website’s HTML code and keeping a close eye on user accounts with abnormal activity patterns is another great way to make things harder for anyone trying to harvest data.