Human benchmark meaning

12/9/2023

Benchmarks are sorted by size (i.e., number of word pairs available). The size of the stimuli dataset is critical to enable future applications in settings with data-hungry models 8.įifteen benchmarks available in the field of Natural Language Processing to investigate semantic representations with similarity-based tasks.

SimLex-999, a dataset specifically targeting semantic similarity, includes a total of 999 pairs 7. For instance, WordSim-353, a dataset including pairs of words linked by either semantic similarity ( cup-mug) or semantic relatedness ( cup-coffee) 5, contains only 353 word pairs 6. First, they are rather limited in size, typically offering not more than a thousand stimuli. Virtually all currently used benchmarks, i.e., a task and its related dataset of stimuli and responses, suffer from one or more of the following limitations (Table 1). However, while NLP models progressively approximate human-like language performance, it is increasingly challenging to evaluate the nature of their internal representations and how closely they align with those supporting human understanding.

Similarly, in natural language processing (NLP), models are often compared against curated benchmarks using behavioural data as ground truth 4. Cognitive neuroscience investigations of the behavioural correlates and neural substrates of semantic representations have focused on probing biological agents with carefully designed semantic paradigms and thoroughly selected stimuli, often inferring representational content and structure from semantic judgments on pairs of words 2, 3. We hope that this openly available, large-scale dataset will be a useful benchmark for both computational and neuroscientific investigations of semantic knowledge.Ī key aspect of human intelligence is the ability to store and retrieve knowledge on objects, facts, and people, via symbols: reading the word lemon activates a multidimensional yet unitary concept which includes its physical attributes (e.g., a lemon is yellow and roundish) but also its relations to other concepts (e.g., you can use a squeezer to get juice out of a lemon) 1. For the 2,255 triplets with varying levels of agreement among NLP word embeddings, we additionally collected behavioural similarity judgments from 1,322 human raters. The dataset includes both abstract and concrete nouns for a total of 10,107 triplets. Here we present a dataset probing semantic knowledge with a three-terms semantic associative task: which of two target words is more closely associated with a given anchor (e.g., is lemon closer to squeezer or sour?). To enable the direct comparison of human and artificial semantic representations, and to support the use of natural language processing (NLP) for computational modelling of human understanding, a critical challenge is the development of benchmarks of appropriate size and complexity. Word processing entails retrieval of a unitary yet multidimensional semantic representation (e.g., a lemon’s colour, flavour, possible use) and has been investigated in both cognitive neuroscience and artificial intelligence. Data and code are released under Creative Commons Attribution 4.0 International Public Licence (CC-BY 4.0 44). The OSF repository includes also the datasheet for the dataset 54. All analysis can be reproduced with the data directly available on the OSF repository safe for the two requiring individual subject data: Results_Demographics_1322.csv and Results_Responses_1322.csv will be accessible only after proper registration with the CNeuromod databank ( ) due to ethical considerations. We also provide a notebook to perform basic exploration of the dataset including all the analysis here reported: ColabNotebook_3TT (available on the OSF repository as well). It can easily be adapted to test novel embeddings (e.g., with different training samples or vocabulary sizes). The code to compare embeddings requires the generated triplets and the embeddings one wishes to use to solve the triplet task. This code can be used to generate novel triplets fitting other experimental goals, for instance triplets at fixed distances between target words or triplets with only abstract (or concrete) terms. The code to generate triplets requires, at a minimum, three inputs: the list words to be used as anchor, words concreteness ratings, and the pre-trained embedding to be used to define word distances. The code used to generate the triplets and compare the embeddings is made available at. Borghesani V, Armoza J, Hebart MN, Brambati SM, Bellec P, 2023.

0 Comments

Human benchmark meaning

Leave a Reply.

Author

Archives

Categories