Contents
Ethics and Machine Learning: The Case of Web Scraping
Scraping data from the internet has become a common practice. There is now a wide range of businesses that have built their entire operations on data that they scrape from other sources. But while web scraping may be commonplace, there are still some persistent ethical issues surrounding the practice that the scraping community needs to address.
Machine learning as technology has long been a source of heated ethical debates, and web scraping is no different. Some people argue that it is flat-out impossible to remove human biases from machine learning algorithms. Even simple scraping tools will betray something of their creator’s own biases and prejudices.
Ethics Vs. The Law
The discussion about ethics in web scraping is different from discussing whether scraping other people’s data is legal. The exact legal status of web scraping remains challenging to decipher and may change according to where in the world you are. There have already been several court cases around the world that have ruled on whether there should be limits to the kind of data that businesses can scrape and where they can scrape from.
The legal question is currently unanswered, and it is taking a long time for any kind of definitive resolution to reveal itself. However, we don’t need any legal rulings to make a judgment about whether it is ethical to scrape data or not.
Most people agree that there are at least some circumstances where we should allow or even encourage web scraping. With web scraping, small startups can take advantage of data repositories that are built and maintained by much bigger and better-funded businesses. However, other people would argue that this constitutes theft as the data in question belongs to the business that gathers and stores it.
Why Ethical Data Use Matters
In the wake of the Cambridge Analytics scandal, and numerous other incidents where Facebook exposed millions of people’s private data to the world, most people are more cautious about their data. A lot of people have abandoned Facebook permanently, thanks to their lax data protection policies. Other people are far more careful about what they share online after seeing how easy it is for even celebrity’s personal data to leak online.
If we want to continue to be able to scrape the internet freely, it is in all of our interests to make sure that we are doing so in an ethical way. When we scrape data or use services that scrape data, we often use other people’s private data. Without a consistent set of ethical standards for businesses to follow, we won’t know whether data about us that scrapers fetch from the internet is put to use responsibly or not.
Use Public APIs
Not everybody wants to have their data scraped. There is an ongoing debate as to whether it should be legal to scrape any data that is freely available online. Naturally, content creators want to be able to access as much data as possible, preferably for free. But for the people who invest time and effort in gathering, organizing, and storing data, it can be a little bit galling to have people benefitting from it for free.
A simple way of ensuring that you are only ever scraping data from people who are happy to share is to be sure to make use of public APIs only. If someone has built a public API to enable you to interface with their databases, it is a safe bet that they are cool with you using their data. You don’t even need to scrape with most APIs, they will enable you to search for the data you need.
Request Rate Limiting
A simple thing that every conscientious scraper should do to minimize their scraping impact on their targets is to introduce limits to the number of requests they send. If you send lots of requests in a short space of time, it can cause serious problems for whoever owns the server you are requesting data from.
Requesting data too frequently can make your traffic indistinguishable from a DDoS attack. If a business or organization thinks that you’re trying to take them offline, they are likely to fight back. This could mean banning your IP address and preventing you from accessing their servers again.
Think Before You Request
Before you start requesting any data from any source, you should first be clear about why you are requesting it. Requesting data for the sake of it is the kind of move that ends up killing a business’s productivity. It doesn’t matter whether you are gathering data for a team project or an individual passion project. In either case, you need to know what data you are collecting and why before you begin.
Don’t Misrepresent Data
If you are using data that you have scraped from other people in your professional work, you must acknowledge this appropriately. You shouldn’t try and pass the underlying data off as your own when you have gathered it from elsewhere. It is important always to give credit where credit is due. More importantly, it is unethical to do otherwise.
On a similar note, if there is an opportunity for you to return the favor and send some of the value that you generate with other people’s data back to them then you should do so. Returning value to the businesses you source key data from helps ensure that you maintain your access. It also balances out the resource cost of your scraping to some extent.
The legal status of web scraping remains murky. There are a few precedent-setting cases currently working their way through court systems around the world. Unfortunately, it seems as if different jurisdictions are going to take different approaches.
The only thing you can do to try and stay on the right side of it is to behave ethically and not do anything that will antagonize the people you scrape your data from. As long as you treat your data sources with respect and you adhere to ethical practices, scraping can benefit you and the people you scrape data from.