What is Web Scraping?
Web scraping (or data extraction) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed.
Web Scraping vs Web Crawling
According to ScrapeOps, Web crawling is the process of traversing a website by following links in pages or the sitemap. Oftentimes, indexing the web pages as it goes. Google uses web crawling to index the web, and provides a powerful search engine that can find content anywhere.
Web scraping, on the other hand, is the process of extracting specific bits of data from a target web page and storing it for your own purposes.
Uses for Web Scraping
- Price Intelligence – Price intelligence refers to monitoring a competitor’s prices and responding to their changes in pricing. Retailers use price intelligence to maintain a competitive edge over their rivals.
- Market Research – Web data extraction plays a vital role in market research. Market researchers use the resulting data to inform their market trend analysis, research and development, competitor analysis, price analysis, and other areas of study.
- Lead Generation – Businesses that want to attract new customers and generate more sales need to launch effective sales and marketing campaigns. Web scraping can help companies gather the correct contact information from their target market—including names, job titles, email addresses, and cellphone numbers. Then, they can reach out to these contacts and generate more leads and sales for their business.
- Business Automation – In some cases, you may need to extract large amounts of data from a group of websites. You need to do this consistently, quickly, and structured. You can use web scraping tools to automatically extract these data sets.
- Real Estate – You need web data extraction to generate the most up-to-date and accurate real estate listings. Web scraping is commonly used to retrieve the most updated data about properties, sale prices, monthly rental income, amenities, property agents, and other data points. Web scraped data also informs property value appraisals, rental yield estimates, and real estate market trends analysis.
- News and Content Marketing – Businesses, political campaigns, and nonprofits that need to keep a close eye on brand sentiment, polls, and other trends often invest in web scraping tools. Content and digital marketing agencies also use web scraping tools to monitor, aggregate, and parse the most critical stories from different industries.
Legality of Web Scraping
The Legality of web scraping is totally dependent on the legal jurisdiction i.e. Laws are country and locality specific. Publicly available information gathering or scraping is not illegal, if it were illegal, Google would not exist as a company because they scrape data from every website in the world.
- United States Court of Appeals for the Ninth Circuit – (LinkedIn v. HiQ Labs (2022)) – The decision echoes the appeal’s court 2019 decision, which upheld a lower court’s 2017 determination in HiQ v. LinkedIn that web scraping doesn’t qualify as accessing a protected computer without authorization.
- The US Supreme Court – (Van Buren v. United States (2021)) – This case does not directly address web scraping, but it does touch upon the use of Computer Fraud and Abuse Act enacted in 1986 (CFAA) in the cases of web scraping. A good commentary about this case can be read here.
It is a complex case that may not answer all questions related to CFAA but it does seem to narrow down the scope of the CFAA considerably, which should serve as a deterrent to companies that rely on the CFAA to target web scraping.
- Eastern District Court of New York – (Genius Media Group Inc vs Google LLC and Lyricfind (2020)) – Google, arguably the world’s largest scraping company, had a web scraping case against them dismissed by Judge Margo Brodie. Google had repeatedly scraped lyrics from Genius to show up in their search results and the Judge dismissed the lawsuit stating
- US District Court for The District Of Columbia – (Sandvig v Sessions (2018)) – A US District court ruling (Sandvig v Sessions) that talks directly about web scraping states:
scraping plausibly falls within the ambit of the First Amendment.
and
That plaintiffs wish to scrape data from websites rather than manually record information does not change the analysis. Scraping is merely a technological advance that makes information collection easier; it is not meaningfully different from using a tape recorder instead of taking written notes, or using the panorama function on a smartphone instead of taking a series of photos from different positions.
- US District Court for The District Of Columbia – (Sandvig v Sessions (2020)) – US District Court in Washington, DC, has ruled that violating a website’s terms of service isn’t a crime under the Computer Fraud and Abuse Act US District Judge John D. Bates in Sandvig v Barr (Civil Action No. 16-1368) said,
Criminalizing terms-of-service violations risks turning each website into its own criminal jurisdiction and each webmaster into his own legislature. Such an arrangement, wherein each website’s terms of service “is a law unto itself“, would raise serious problems.
- US Court of Appeals for the Ninth Circuit – (Oracle vs Rimini Street (2018))
“Taking data using a method prohibited by the applicable terms of use” — i.e., scraping — when the taking itself generally is permitted, does not violate” the state computer crime laws.
“As EFF puts it, ‘[n]either statute . . . applies to bare violations of a website’s terms of use—such as when a computer user has permission and authorization to access and use the computer or data at issue, but simply accesses or uses the information in a manner the website owner does not like.’”