Skip to Content Skip to Footer
Fields of Gold: Scraping Web Data for Marketing Insights

Fields of Gold: Scraping Web Data for Marketing Insights

Johannes Boegershausen, Hannes Datta, Abhishek Borah and Andrew T. Stephen

web data deciphering

Listen to the authors present their findings (source: June 2022 JM Webinar)

The recent ruling of the Ninth Circuit in HiQ Labs v. LinkedIn underscores the importance of navigating the legal challenges when using web scraping to collect data for academic research. While it may be permissible to collect information from publicly available sites, researchers still need to be cautious about how they design their extraction software. For example, collecting information from publicly available user profiles in some jurisdictions may trigger privacy concerns—and prompts researchers to anonymize their data already during the collection.

While legal aspects of collecting web data have received some attention in the recent past, it is not the only challenge researchers and managers face. Using scraping and application programming interfaces (APIs) to collect web data prompts a wide range of validity concerns, that – if unaddressed – may undermine the quality of the research. A new article in the Journal of Marketing proposes a methodological framework, focused on enhancing the validity of web data, while balancing legal and technical concerns.

Advertisement

While marketing researchers increasingly employ web data, the idiosyncratic and sometimes insidious challenges in its collection have received limited attention. How can researchers ensure that the datasets generated via web scraping and APIs are valid? Our research team developed a novel framework that highlights how addressing validity concerns requires the joint consideration of idiosyncratic technical and legal/ethical questions.

Our framework covers the broad spectrum of validity concerns that arise along the three stages of the automatic collection of web data for academic use: selecting data sources, designing the data collection, and extracting the data. In discussing the methodological framework, we offer a stylized marketing example for illustration. We also provide recommendations for addressing challenges researchers encounter during the collection of web data via web scraping and APIs.

Our article further provides a systematic review of more than 300 articles using web data published in the top five marketing journals. Using our review, we devise a typology of how web data has advanced marketing thought. Understanding the richness and versatility of web data is invaluable for scholars curious about integrating it into their research programs.

Interested researchers can access the database developed for this review on our companion website at https://web-scraping.org/. This website also features additional useful resources and tutorials for collecting web data via web scraping and APIs.

Finally, we use our methodological framework and typology to unearth new and underexploited “fields of gold” associated with web data. We seek to demystify the use of web scraping and APIs and thereby facilitate broader adoption of web data across the marketing discipline. Our Future Research section highlights novel and creative avenues of using web data that include exploring underutilized sources, creating rich multi-source datasets, and fully exploiting the potential of APIs beyond data extraction.

Read the full article

From: Johannes Boegershausen, Hannes Datta, Abhishek Borah, and Andrew T. Stephen, “Fields of Gold: Scraping Web Data for Marketing Insights,” Journal of Marketing.

Go to the Journal of Marketing

Johannes Boegershausen is Assistant Professor of Marketing, Erasmus University, The Netherlands.

Hannes Datta is Associate Professor of Marketing, Tilburg University, The Netherlands.

Abhishek Borah is Assistant Professor of Marketing, University of Washington.

Andrew T. Stephen is Associate Dean of Research and L’Oreal Professor of Marketing, University of Oxford, UK.