A Framework for data mining of structured semantic markup extracted from educational resources on University websites
Authors: Lorena Recalde, Rosa Navarrete, Luis Rosero Correa
Abstract: The coronavirus pandemic has forced education at all levels to change from face-to-face mode to online learning. In keeping with that purpose, Universities are releasing a significant number of educational resources on the Web to support virtual education. Final users, who need these educational resources, explore the Web through search engines such as Google, Yahoo, Yandex, or Bing; unfortunately, the search results they obtain lack accuracy and are not necessarily adequate to their requirements. This problem is because Web resources release does not consider their visibility or ease of being found. One way to improve the experience of users who browse the Web is by delivering more appropriate content in response to their searches. An alternative to enhancing the meaning of web searching results is embedding structured semantic markup in the HTML of web pages through standards such as JSON-LD and Schema.org vocabulary, in compliance with W3C recommendation. Search engines can interpret this markup to understand the resources being published and, consequently, improve the rightness of search results. For example, Google uses the structured semantic markup to show rich fragments, Rich Snippets, or even Knowledge Graph in user searches.This research proposes a framework that enables a systematic analysis of the websites of the top-ranking universities, focused on the educational content they provide to review the embedded semantic markup annotated by using JSON-LD and the Schema vocabulary. To this end, a worldwide list of the universities that are part of the top international ranking has been compiled. Then, by using Web Scraping techniques, we have analyzed these universities' Websites in search of educational resources and reviewed if the embedded structured markup is included. Finally, data mining techniques have been used to describe and organize the educational resources obtained.The contribution of this work is two-fold. Firstly, the analysis of embedded structured markup that uses Schema vocabulary and JSON-LD format in university websites. This analysis is relevant since previous research has not explicitly focused on the educational field or has not used a specific dataset within this context. Secondly, the proposal of a framework that allows accomplishing this type of analysis of embedded structured markup from a data collection phase to obtaining results and indicators on the data. It addresses the data mining process from download to the final data analysis to get information. The proposed framework consists of eleven components distributed in three well-defined layers: data access layer, service layer, and application layer. The framework component development process is defined by merging two methodologies, Design Science Research (DSR), to guide the creation of an artifact, and CRISP-DM, to address the data mining process. The architecture of the framework integrates tools such as Scrapy (Python), for web scraping and crawling functions, MongoDB for manipulating semi-structured data with a NoSQL management mode, Redis as an in-memory database (auxiliary) that through queries allows to determine if the URLs that are extracted in the Web Scraping process have already been processed or not (duplicate control), and Apache Kafka as a communication intermediary and facilitator of the flow or exchange of information between the other components.Moreover, this work provides a data set made up of the HTML pages of the universities' Web sites that can be used for further analysis.
Keywords: structured semantic markup, data mining, user experience, framework, educational resources
Cite this paper: