A Data Driven Approach for Language Acquisition

Open Access
Conference Proceedings
Authors: Ting LiuSharon SmallJames KubrichtPeter TuHarry ShenLydia CartwrightSamuil Orlioglu

Abstract: Automatic Language Acquisition focuses on teaching an agent to acquire knowledge to understand the surrounding environment and be adaptive to a new environment. The traditional language understanding models fall into three main categories, supervised, semi-supervised, and unsupervised. A supervised approach is usually accurate but requires a large training dataset, which building process is expensive and time consuming. In addition, the trained model is difficult to shift to other domains. On the other hand, building an unsupervised model is cheap and flexible, but its performance is usually significantly lower than the performance of the supervised one. With a relatively small set of guidance at the beginning, a semi-supervised approach can teach itself through the unlabeled dataset to achieve a comparable performance as a supervised modal. However, building the guidance is not a trivial task since the learning process won’t be effective if the relationship between labeled data and unlabeled data. Different from the traditional modals, when children learn, they do not require large amounts of training data. Instead, they can accurately generalize their knowledge from one object to other objects. In addition, the communication between them and their parents/teachers/peers helps to fix the wrong claims from the generalization. In this paper, we present a multimodal system that simulates the children’s learning process to acquire the knowledge of the entities by studying three types of attributes, descriptive (the outlook of an entity), defining (the components of an entity), and affordance (how an entity can be used). We first utilize an unsupervised Emergent Language (EL) approach to generate symbolic language (EL codes) to interpret the given images of the entities (10 images per entity). The K-Mean clustering methods to group entity images that share the similar EL code. Then we employ a data driven approach to teach the agent the attributes of the entities in the clusters. We first calculate the tf-idf scores of the words from the text pieces (extracted from Corpus of Contemporary American English, a balanced corpus for American English) containing an entity. From the top ranked words with a few sample text pieces, the human expert tells which words are attributes and the attribute type. The human expert also marked the text pieces having the attributes. For example, red is the descriptive attribute of cup in “the red cup’ but not in “a cup of red wine”. The learned knowledge is sent to a bootstrapping modal to find not only new attributes but also new entities. Our system results show that the data driven approach spent much less time but learned more attributes compared with the baseline of our system, teaching the agent the defining attributes of the entities using a carefully designed curriculum.

Keywords: language acquisition, data driven, bootstrapping, human-machine interface, clustering

DOI: 10.54941/ahfe1002842

Cite this paper: