Evaluating embedded semantics for accessibility description of web crawl data

Open Access
Conference Proceedings
Authors: Rosa NavarreteDiana Martinez- MosqueraLorena RecaldeMarco Aguirre

Abstract: The Web is ever expanding, even more by the need for content consumption derived from the pandemic. This fact highlights the need for equity in access to Web content by all people, regardless of their disabilities. To this end, it is essential to focus on web accessibility issues. The World Wide Web Consortium (W3C), the leading organization responsible for ensuring the growth of the social value of the Web, establishes standards, protocols, and recommendations to improve the reach extent of web content for people. For instance, Web Content Accessibility Guidelines (WCAG) promote the achievement of web accessibility. Furthermore, other W3C recommendations foster embedded semantic into the web content to help browsers build a machine-readable data structure aiming to produce an enriched description in search results supporting people to find the right content for their queries and, consequently, improving user experience. Searching for specific web content is especially striving for people with disabilities because they could be forced to explore many search results before finding some content that matches their accessibility requirements. If embedded semantic communicate the accessibility properties of the content, the search will be more productive for everyone but even more for people with special needs. For embedded semantic, two components are required, a vocabulary and an encoding format. Schema.org vocabulary has experienced high growth and encompasses plenty of descriptors for each type of web information, including the set of descriptors for accessibility conditions information. Regarding the format, JSON-LD is the latest W3C recommendation for encoding due to its ability to make JSON data interoperate at Web-scale. It provides a quickly transforming for Linked Data format and is simple enough to be read and written by people. This research conducts a quantitative analysis of the embedded semantic into the web content by processing a dataset obtained from millions of web crawl data for 2021. The data arrive from distinct provenance and purposes at a global scale. In this web content, each annotation is made through script JSON-LD of embedded semantic with Schema's vocabulary. The analysis defines how the accessibility descriptors are used in conjunction with other classes and properties to describe the web information on personal blogs, organizations, events, educational content, universities, persons, commerce, sports, medicine, entertainment, and more. The results provide a perspective of the awareness for accessibility in the different purposes of the Web.The processing was performed on collected zip files that contain over three hundred million records. This analysis was conducted using massive data analysis techniques such as key-value modeling with Python for processing and a NoSQL database such as MongoDB for storage. A new dataset with normalized data was generated with information about domains, types of web content, and properties associated with the accessibility descriptor. The collection and storage layers were implemented on a computing platform with 30GB of RAM, 10 CPUs, and 2TB of storage.This research delivers two main contributions. Firstly, the analysis of the interest in the Web for using accessibility descriptors in embedded semantic. The quantitative results enable us to appreciate the concern about equity and inclusion made visible through accessibility issues in different entities, according to the web domains. Moreover, these results reveal how the W3C recommendation of embedded semantic is being adopted to create a more organized and better-documented Web. Second, processing the raw dataset result in a new normalized dataset in JSON format with information about domains, web content types, and properties associated with the accessibility descriptor. This new dataset will be available for further analysis of the embedded semantic.

Keywords: embedded semantic, Schema vocabulary, JSON, LD, accessibility properties, massive data analysis, user experience

DOI: 10.54941/ahfe1003774

Cite this paper: