loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2502.14791

Rapid Word Learning Through Meta In-Context Learning

Authors: Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake

Published: 2025-02-20

Abstract:

Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

Paper Content: on Alphaxiv
Page 1: Rapid Word Learning Through Meta In-Context Learning Wentao Wang1Guangyuan Jiang2Tal Linzen1Brenden M. Lake1 1New York University2Peking University {ww2135, linzen, brenden}@nyu.edu jgy@stu.pku.edu.cn Abstract Humans can quickly learn a new word from a few illustrative examples, and then systemati- cally and flexibly use it in novel contexts. Yet the abilities of current language models for few- shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words ( Minnow ). This method trains language models to gener- ate new examples of a word’s usage given a few in-context examples, using a special place- holder token to represent the new word. This training is repeated on many new words to de- velop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language en- ables strong few-shot word learning, compa- rable to a large language model (LLM) pre- trained on orders of magnitude more data. Fur- thermore, through discriminative and genera- tive evaluations, we demonstrate that finetun- ing pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data effi- ciency of Minnow and its potential to improve language model performance in word learning tasks. 1 Introduction Children can quickly learn a new word, or at least make meaningful inferences about its meaning, given only a few examples of its usage (Carey and Bartlett, 1978; Bloom, 2000). For example, suppose a child who did not know the word skihears the following mentions of the word (without visual examples): “ Susie learned to ski last winter ”, “People ski on tall mountains where there’s lots of snow ”, and “ I saw Susie ski fast down the snowy mountain .” From these usage examples, the child might infer that skiis a verb for a winter activity involv- ing sliding down snowy mountains, and could begin understanding and using the word appropriately in new contexts. This ability to generalize and use a new wordin novel contexts from just a few examples reflects chil- dren’s remarkable data efficiency in language learning, allowing them to quickly acquire vocabulary without requiring tens or hundreds of examples per word. Compared to humans, current pre-trained language models are inefficient word learners, both in the total amount of pre-training data and the number of exam- ples needed for each word. Even though large language models (LLMs) are typically pre-trained on four or five orders of magnitude more language input than any sin- gle human could receive (Linzen, 2020; Frank, 2023), they struggle with systematic generalizations of words that are rare or unseen in their training data (Wei et al., 2021; Razeghi et al., 2022; Kim et al., 2022; Batsuren et al., 2024; Land and Bartolo, 2024). This contrast between human learning and language model training raises two long-term research questions: 1) Could language models develop a human-like abil- ity for few-shot word learning without astronomical amounts of training data? 2) Could existing LLMs be adapted to improve their few-shot word learning abil- ities, allowing them to systematically and flexibly use new words in new contexts? Here, we introduce a simple method, Meta-training for IN-context learNing Of Words ( Minnow ), to train or finetune a language model to develop an in-context few-shot word learning capability (see Figure 1 for an illustration of our method). We adopt meta-training (i.e., meta-learning) since it has had successes in endowing neural networks with stronger systematic generaliza- tion, closely related to our objective of word learning (see Russin et al., 2024 for a review of the successes). Specifically, we use Meta-training for In-Context Learn- ing ( MetaICL ; Min et al., 2022; Chen et al., 2022) to train from scratch or finetune an auto-regressive lan- guage model to generate new usages of a new word given a set of illustrations of the new word in its previ- ous context. In-context learning ( ICL) builds and uses contextual representations of the new word on the fly without parameter updates. MetaICL repeats ICL on many different new words and optimizes the model pa- rameters for a general word-learning ability. To demonstrate the data efficiency of our method, we train language models from scratch with Minnow using small datasets: a corpus of child-directed speech (CHILDES; MacWhinney, 1992) and a corpus approx- imating the word count a child encounters during lan- guage acquisition (BabyLM-10M; Warstadt et al., 2023). 1arXiv:2502.14791v1 [cs.CL] 20 Feb 2025 Page 2: Word: aardvark Study examples: Look there’s an aardvark, it’s like an anteater. See the aardvark has a long snout for eating bugs. That must be the aardvark’s house. Generalization example: The aardvark is hungry, it wants some snacks.Word: ski Study examples: Susie learned to ski last winter. People ski on tall mountains where there's lots of snow. I saw Susie ski fast down the snowy mountain. Generalization example: He will ski past the pine trees.Sentences: You can go fast or slow, and there are fun turns. Some animals hibernate in winter. Let’s go to grandma’s house! We warmed up by the fire.Meta-learning { aardvark: examples, ski: examples, … }Language modeling { sentence1, sentence2, … }<sep> Look there’s an [new-token], it’s like an anteater. <sep> See the [new-token] has a long snout for eating bugs. <sep> That must be the [new-token]’s house. <sep> The [new-token] is hungry, it wants some snacks. <sep><sep> Susie learned to [new-token] last winter. <sep> People [new-token] on tall mountains where there’s lots of snow. <sep> I saw Susie [new-token] fast down the snowy mountain. <sep> He will [new-token] past the pine trees. <sep><sep> You can go fast or slow, and there are fun turns. <sep> Some animals hibernate in winter. <sep> Let’s go to grandma’s house! <sep> We warmed up by the fire. <sep>Update model parameters with token prediction lossEpisodes (as extracted from corpus)Episodes (as appear to the model)Figure 1: Illustration of Minnow (top) and language modeling (bottom), which can be mixed together during training such that both contribute to model updates. Each meta-learning episode in Minnow aims to learn a new word from a set of study examples (sentences that use the word) in the context and then generate a generalization example that also uses the word. Each language modeling episode contains a set of unrelated sentences without meta-learned words. An episode will be converted into a single sequence in which we replace the word to be learned (if it is a meta-learning episode) with a special placeholder token (e.g., [new-token] ) and concatenate/wrap the sentences with another special separator token (e.g., <sep> ). We do gradient updates of the model parameters to optimize the next-token prediction loss on the sequence. To foreshadow our results, we find that our method’s performance on few-shot classification of new words from these datasets approaches that of the pre-trained Llama-3 8B (Llama Team, Meta AI, 2024), which was trained on vastly more data. This highlights how this ability can be developed from human-scale child-input data rather than the orders-of-magnitude larger datasets typically used to train LLMs. We also finetune Llama-3 8B with Minnow to see if we can enhance its word-learning ability. In a series of discriminative and generative evaluations, we show that this improves Llama-3 8B ’s ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, where each new word is learned from one or a few in-context examples. Most of these improvements are achieved without specific training on these evaluation tasks. We will release our code upon publication of our work. 2 Related Work 2.1 The Rare Word Problem Word frequencies in natural corpora follow a highly skewed (Zipfian) distribution (Zipf, 1949), resulting in a heavy tail of rare words. Additionally, new words are constantly entering the language (Heaps, 1978). To represent all possible words, various word-form-based methods have been proposed, including subword- and character-based tokenizations and using morphological information (see Mielke et al., 2021 for a comprehen- sive survey). However, representing a word alone does not help in learning it from a few contexts in which it occurs. Models optimized for conventional language modeling still struggle with the usage of unfamiliar or completely novel words, tokens, or token sequences,where word-forms or token identities alone do not pro- vide enough information (Ott et al., 2018; Schick and Schütze, 2020; Wei et al., 2021; Razeghi et al., 2022; Kim et al., 2022; Batsuren et al., 2024; Land and Bar- tolo, 2024). Instead of representing new words based on word-forms, we discard word-form information and use a dedicated special placeholder token that is the same for every new word. In this way, we aim to develop a general and efficient ability to learn a word from a few contexts of its usage. 2.2 Few-Shot Word Learning Another line of previous work targets the problem of learning a new word from a few examples. Most previ- ous work aims to produce a representation for the new word, i.e., an embedding, that fits into the global word embedding space so it can be used in the same way as other learned words (Mikolov et al., 2013; Pennington et al., 2014). The embedding can be produced by aggre- gating the embeddings of the contexts that the new word appears in (Lazaridou et al., 2017; Khodak et al., 2018), finetuning the embedding within the context (Herbelot and Baroni, 2017; Lampinen and McClelland, 2017; Hewitt, 2021; Kim and Smolensky, 2021), or utilizing the word-form information (Luong et al., 2013; Schick and Schütze, 2019). More recent work uses Transformer layers to produce the embedding based on Word2Vec embeddings (Hu et al., 2019, HiCE), or by aggregating similar embeddings of word contexts from a memory system (Sun et al., 2018, Mem2Vec). Also related to our approach, Teehan et al.’s (2024) work uses a meta- learning framework named CoLLEGe to train a Trans- former encoder to produce an embedding for a new word from its examples of usage. Our method also tar- gets few-shot word learning, but is simpler than Teehan et al. (2024) in architecture and training and does not 2 Page 3: produce a separate embedding for each new word. 2.3 Meta-training for In-Context Learning Building on LLMs’ in-context learning abilities (Brown et al., 2020), Meta-training for In-Context Learning (MetaICL ) optimizes language models on multiple dif- ferent tasks, each learned from a few in-context exam- ples (Min et al., 2022; Chen et al., 2022).1A class of tasks that MetaICL (or similar curriculums) aim to learn and generalize requires inferring the context-dependent mapping from the symbols to meanings (Lake and Ba- roni, 2023; Huang et al., 2024; Anand et al., 2025; Park et al., 2025). We follow this work to use MetaICL for our word learning task, in which the mapping from a new word to its meaning should be inferred purely from its usage in the context. 3 Method The goal of our method, Minnow , is to enable a model to infer the meaning of a new word from a few exam- ples of its usage so it can understand and generate novel usage examples of the word, coherently and systemati- cally combining it with other words in new contexts. To achieve this, Minnow trains the model to generate an- other usage example of the new word—a task that, when sufficiently challenging, requires mastery of this abil- ity.Minnow is a general framework that can be applied to both training a model from scratch and finetuning a pre-trained model. After describing the method, we introduce the training data we use, a held-out word clas- sification task for model evaluation and hyperparameter tuning, and how we use the off-the-shelf Llama-3 8B as a baseline for our experiments. 3.1 Method: Minnow Following the typical meta-learning approach, we con- struct episodes tTiuN i“1, each Ticonsists of Kexamples txpiq kuK k“1sampled in accordance with the desired task (Figure 1: top). In each episode, the model’s task is to learn a new word wi; each example xpiq kis a sentence il- lustrating how wiis used. We concatenate the examples txpiq kuK k“1into a single sequence, separated by a special separator token ( <sep> when training from scratch or a reserved special token in the Llama-3 8B vocabulary when finetuning Llama-3 8B ). The objective is next- token prediction on this concatenated sequence: we ex- pect the model to predict a new usage example given the previous examples, i.e., ppxpiq k|xpiq 1, . . . , xpiq k´1q. We re- place (mask) all occurrences of wiin the sequence with a special placeholder token ( [new-token] when train- ing from scratch or a different reserved special token when finetuning Llama-3 8B ). The same placeholder token for the new word is shared across all episodes, such that the model does not learn a new embedding 1MetaICL is different from Coda-Forno et al. (2023), which uses in-context learning instead of parameter updates to learn from multiple tasks.each time. Using the skiexample from Section 1, the sequence for training models from scratch would be <sep> Susie learned to [new-token] last win- ter<sep> People [new-token] on tall moun- tains where there’s lots of snow <sep> I saw Susie [new-token] fast down the snowy mountain <sep> Note that our setting differs from previous MetaICL settings (Min et al., 2022; Chen et al., 2022; Lake and Baroni, 2023) in two ways. First, each example is not an input–output pair pxpiq k, ypiq kq, but just xpiq k. Second, there is no explicit separation between study examples and a query: our setting effectively uses every example xpiq k as a query with all previous examples xpiq 1, . . . , xpiq k´1as its study examples. When we train a model from scratch, we also pro- vide episodes of language modeling (without masked new tokens) to further facilitate language learning, as illustrated in Figure 1 (bottom). Each of these episodes consists of the same number of Krandomly sampled un- related sentences, without new words. We concatenate them in the same format and train the model to perform next-token prediction on the concatenated sequences. Training batches of language modeling episodes inter- leave with the batches of meta-learning episodes. The model can determine whether an episode is for meta- learning or language modeling from whether the special placeholder token occurs in the first sentence. 3.2 Data To demonstrate the data efficiency of our method com- pared to humans, we use data sources that are close to children’s language input in quantity or quality (Warstadt et al., 2023). We construct one dataset from each of two corpora: CHILDES (MacWhinney, 1992) and BabyLM-10M (Warstadt et al., 2023). CHILDES is a corpus of transcriptions of child–caregiver speech interactions. We use input to children (excluding ut- terances produced by children) in the North American English portion of CHILDES. BabyLM is an English dataset including child-directed speech as well as addi- tional data sources, such as children’s books, transcrip- tions of dialogs between adults, and Wikipedia articles. We use the 10M word corpus constructed as part of the first BabyLM Challenge. Each dataset consists of two disjoint components, one for meta-learning (the leftmost set in Figure 1: top) and the other for language modeling (the leftmost set in Fig- ure 1: bottom). We select a set of lower-frequency words in the corpus to be meta-learned in the meta-learning component.2Each meta-learned word whas a set of nw sentence examples illustrating its usage. We assign each sentence in the corpus to at most one meta-learned word, so the identity of the word masked by the placeholder 2Different word-forms of the same lexeme, like “ ski,” “skis,” and “ skiing ,” are treated as different words in the dataset. See Appendix H for further discussion. 3 Page 4: token is not revealed in other meta-learning episodes. During each training epoch, the nwexamples for each word ware split into tnw Ku(non-overlapping) episodes ofKexamples, such that more frequent words have more episodes. This way of sampling episodes preserves the original Zipfian distribution of the word frequencies. Examples in the episodes are shuffled for each training epoch. Other sentences in the corpus that have no meta- learned words are used for language modeling (Figure 1 bottom). We split both the meta-learning component (by word) and the language modeling component (by sentence) into training (80%), validation (10%) and test (10%) portions. Each dataset is used for both training models from scratch and finetuning pre-trained Llama-3 8B , but the text is formatted and tokenized differently (in addi- tion to the different special tokens in Section 3.1; see Appendix B for the differences). We provide additional details about data preprocessing, sentence assignment, dataset splitting, and text formatting in Appendix A, with statistics of our datasets shown in Table 5. In the training portion, our CHILDES dataset contains 7,790 words to be meta-learned and has a total of 5.8M tokens, while our BabyLM-10M dataset contains 15,821 words to be meta-learned and has a total of 7.8M tokens. In comparison, a child receives roughly 3M to 12M words per year (Frank, 2023), and thus our training data is of a similar magnitude to a year’s worth of linguistic input for a child. 3.3 Held-out Word Classification We introduce a word classification task, in which we measure the model’s ability to discriminate the identi- ties of new words that were never seen during training (i.e., held-out), based on in-context study examples. Val- idation accuracy on this task is used to tune training hyperparameters (e.g., learning rate; described later). Given a query example sentence qthat uses a new word and a set of Ccandidate words twpcquC c“1, the task for the model is to match the query example to the most suitable one among the Ccandidate words. Each wpcqis represented by a context containing a set of K´1study examples txpcq kuK´1 k“1illustrating its usage. The context of wpcqis a sequence in the same format as the first K´1examples in a train- ing episode, ending with a separator token (e.g., <sep> ): <sep> xpcq 1<sep>¨¨¨<sep> xpcq K´1<sep> . The query example is formatted as a continuation sequence of the context: q<sep> . This formatting ensures that con- catenating a context sequence and a query sequence results in a sequence with Kexamples, just like a se- quence for a meta-learning training episode. To de- termine the best match, we compute the conditional likelihood of the query sequence given the context: pLMpq|xpcq 1, . . . , xpcq K´1q. The model predicts the word corresponding to the context with the highest likelihood: arg maxcpLMpq|xpcq 1, . . . , xpcq K´1q. The prediction is correct if it is the ground-truth word in the query q.We evaluate each model (trained from scratch or finetuned) by measuring the classification accuracy on held-out meta-learned words from the validation or test portions of the model’s training or finetuning corpus. For each evaluation, we group Cdistinct meta-learned words into a C-way classification task. For each word, we sample K´1study examples and one query exam- ple to construct the task. See Appendix C for additional details on task construction. 3.4 Baseline: Off-the-shelf Llama-3 8B For training models from scratch, we need an LLM that is pre-trained on massive data with conventional language modeling for data-efficiency comparison. To determine the effectiveness of finetuning an LLM, we need to evaluate its baseline word-learning ability. To address both needs, we use the off-the-shelf Llama-3 8B model as a baseline for word-learning tasks. We experi- ment with both the pre-trained and the instruction-tuned variants of the model. We primarily report baseline re- sults from the pre-trained variant, and present results from the instruction-tuned variant only in the generative settings, where its performance may differ significantly from that of the pre-trained one. For evaluation, we present a meta-learning episode to Llama-3 8B in a text format similar to the training or finetuning sequences (Section 3.1), but designed to be more natural and closer to its pre-training data. In particular, we use a pseudo- word (e.g., “ dax”) as the placeholder for the new word, with a newline character and a star “ \n * ” serving as the separator between examples, effectively formatting the examples as a list.3Using the skiexample in Section 1 again, the formatted text appears as follows: * Susie learned to daxlast winter * People daxon tall mountains where there’s lots of snow * I saw Susie daxfast down the snowy moun- tain * The “ \n * ” at the end serves as the last separator, like the last <sep> in the example sequence in Section 3.1. 4 Training Models From Scratch In this section, we investigate whether models can de- velop the ability of few-shot word learning from human- scale input. We use the GPT-NeoX transformer architec- ture (Andonian et al., 2023) with configurations mod- ified from Pythia-160M (Biderman et al., 2023).4We 3We choose the pseudo-word to be meaningless. However, a pre-trained LLM may ascribe a meaning to the pseudo-word based on its form. We acknowledge that replacing a word in an example with a pseudo-word could mislead the LLM and weaken the baseline. See Appendix H for detailed discussion. 4We use an architecture with modern features such as rela- tive positional encoding which may help in extrapolation to longer sequences and more examples. See Appendix B for details of our modifications. 4 Page 5: use word-level tokenization. We exclude words with a frequency less than five from the vocabulary and replace them with <unk> tokens. We likewise remove the words that are to be meta-learned from this vocabulary and replace all of their occurrences in sentences other than their meta-learning episodes with <unk> . As mentioned in Section 3.1, the vocabulary also includes two spe- cial tokens: the placeholder token [new-token] and the separator token <sep> . On each of the two datasets (CHILDES and BabyLM- 10M) we train three models from scratch (i.e., the mod- els are randomly initialized), each with K“5examples per episode and a different random seed. In each of the three runs, we choose the checkpoint with the lowest validation loss on the meta-learning objective. Using one random seed, we fix the batch size and tune other training hyperparameters, including the learning rate and weight decay, for the best 4-way ( C“4) held-out word classification accuracy on the validation portion of the dataset (the task was introduced in Section 3.3). We then apply the same training hyperparameters to the other seeds. See Appendix B for detailed architecture configurations and training hyperparameters including batch size, learning rate (with scheduling), and weight decay. In the following, we report mean accuracies of models across the three runs on the test portion of the dataset they were trained on. Results Models trained from scratch on K“5 examples per episode sampled from CHILDES and BabyLM-10M achieve test accuracies of 72% and 77%, respectively, on the 4-way ( C“4) classification task. These results are substantially higher than random chance (25%) and close to the 71% and 78% accura- cies achieved by Llama-3 8B baseline, which was pre- trained on orders of magnitude more data. We provide results in additional settings, including experiments with K“10examples on CHILDES and 8-way ( C“8) classification, in Appendix C, Table 6. Across all set- tings, models trained from scratch consistently achieve accuracies well above chance and within a 3% margin of theLlama-3 8B baseline. These findings (on CHILDES in particular) demonstrate that few-shot word learning can be effectively acquired using our method, even with human-scale child-input data. 5 Finetuning Pre-trained LLMs In this section, we test if our method can improve pre- trained LLMs’ in-context few-shot word learning abili- ties. We finetune Llama-3 8B with Minnow three times on the meta-learning component of BabyLM-10M, each run with K“5examples per episode and a different random seed.5We refer to the models finetuned with Minnow asMinnow models. We do not include the lan- guage modeling components since the LLM already learned a large vocabulary and is capable of language 5We focus on finetuning models on BabyLM-10M in this section, since it is more diversified and usually yields better results than CHILDES.modeling. We finetune from both the pre-trained and instruction-tuned variants of Llama-3 8B , but we refer to the models finetuned from the pre-trained variant by default, same as for the baseline (Section 3.4). We freeze all of the model’s parameters except the input and output embeddings of these two special tokens. We initialize the embeddings of these two special tokens as the mean of all other input/output embeddings (He- witt, 2021). We select the checkpoint for each run and tune the learning rate in the same way as when training from scratch, except that we do not apply weight decay (Section 4). See Appendix B for more details on text formatting, tokenization, and training hyperparameters including batch size and learning rate (with scheduling). In the following, we evaluate the Minnow models and baselines on a series of tasks. 5.1 Held-out Word Classification We first evaluate models on the held-out word classifi- cation task (Section 3.3). Finetuning Llama-3 8B with Minnow boosts the test 4-way ( C“4) classification accuracy from the baseline level of 78% to 87% on BabyLM-10M (and from 71% to 79% on CHILDES). We provide results for additional values of KandC in Appendix C, Table 6; broadly, across all settings, theMinnow model improves test accuracy by 8–10% over the Llama-3 8B baseline. These findings show that Minnow finetuning effectively improves the pre-trained LLM’s in-context few-shot word learning ability. Despite these strong results, this task does not assess more fine-grained aspects of meaning that may not be apparent from discriminating an arbitrary set of words, and the semantic coherence of the usage contexts could be a shortcut utilized by the model (see Appendix C for further discussion). To address this, we provide the next analysis focusing on the syntactic categories of words. 5.2 Syntactic Category Classification In this evaluation, we test if models can differentiate words in different syntactic categories, a crucial feature for systematic generalization. We follow the classifica- tion paradigm introduced in Section 3.3. We use the methodology of Kim and Smolensky (2021) as well as the dataset they constructed from MNLI, a Natural Language Inference dataset (Williams et al., 2017). The dataset focuses on four syntactic categories (noun, verb, adjective, and adverb) and tests the ability to differenti- ate each pair of categories. See Appendix D for details of the dataset. In each instance of the classification task, we learn two new words wp1qandwp2qin different syntactic cat- egories; the syntactic category of each new word wpiqis unambiguously signaled by a study example xpiq(replac- ing the word with the placeholder, e.g., [new-token] ). For example, say wp1qis a noun and wp2qis a verb: (1)A[new-token] needs two people. (forwp1q) (2)She[new-token] at the group. (forwp2q) 5 Page 6: We test our models on query examples that use a word in one of the two categories, as in the following examples: (1)Keep everyone else company by sitting in the [new-token] .(expecting wp1q) (2)The colonel [new-token] us to a hotel. (expecting wp2q) Note that, unlike the previous task, query examples are semantically unrelated to the study examples in this task, thus excluding the shortcut of semantic coherence. Below, we report the mean accuracies across the three runs. Results We first find that the Llama-3 8B baseline achieves 64% accuracy on this task, which is higher than random chance (50%), suggesting that it can infer the syntactic categories of new words in one shot and generalize them to novel contexts. The Minnow model improves accuracy to 83%, a 19% increase over the base- line. Fine-grained results from models finetuned with Minnow (and trained from scratch) are provided in Ap- pendix D. We find in all settings that the Minnow model improves accuracy by 11–26% compared to the baseline on all pairs of categories. These improvements show that Minnow finetuning effectively helps in learning the syntactic categories of new words and generalizing accordingly. In addition, note that our models are not specifically finetuned on this syntactic category classifi- cation task and dataset, demonstrating the generality of the acquired word learning ability. 5.3 New Usage Example Generation The two tests we have described so far evaluate models in a discriminative setting. Here, we quantitatively and qualitatively evaluate if models use the new word ap- propriately in a generative setting. For a Minnow model finetuned with Kexamples per episode, we evaluate it by showing it K´1in-context study examples, for- matted as a sequence in the classification setting (Sec- tion 3.3). We ask the model to do what it was trained for: We prompt the model with this sequence of study examples, and because the sequence ends with a sep- arator token, the model will continue the sequence by generating a new usage example, ending with another separator token as End-Of-Sequence. We sample study examples from two datasets: the BabyLM-10M test portion in Section 3.2 and the Chimera dataset (Lazaridou et al., 2017). The Chimera dataset was specifically constructed for few-shot word learning. It has 33 different new words for learning, each referring to a “chimera” concept, i.e., a mixture of two existing and related concepts (e.g., a cello and a bag- pipe). The usage examples of a new word are sentences using one of the components of the chimera, randomly extracted from a large corpus. See Appendix F for addi- tional details of the dataset and our preprocessing. For the quantitative evaluation, we compare a pair of new usage examples generated from Llama-3 8B baseline and a Minnow model finetuned from it. TheNew Usage Example Definition Variant Method BabyLM- 10M testChimera CoLLEGe- DefGen pre- trainedbaseline 32 42 29 +Minnow 52 55 39 instruction- tunedbaseline 41 52 33 +Minnow 47 36 37 Table 1: Percentages of wins of each model when compar- ing the generations from Llama-3 8B baseline (pre-trained to instruction-tuned) with a Minnow model finetuned from that baseline, judged by GPT-4o . The left two datasets are for new usage example generation in Section 5.3, and the right-most one is for definition generation in Section 5.4. Each new example or definition is generated by greedy decoding. (Results of top-p sampled generations are shown in Table 9 in Appendix E.) The percentage of ties is the remaining after subtracting the win percentages of the two models. GPT-4o more frequently chooses the Minnow model as the winner compared to the corresponding baseline in all settings except for the instruction-tuned variant on Chimera. comparison is simulated as a head-to-head competition following the methodology in the definition generation section of Teehan et al. (2024). Specifically, we provide GPT-4o (OpenAI, 2024) the same K´1study examples in a list format with a pseudo-word “ dax” as the place- holder for the word, as in the baseline (without the last separator; Section 3.4), followed by a question “Which of the following is a better next example for the word ‘dax’, or they tie?” with three shuffled options, including the two generations and one “Tie”. (See Appendix E for detailed settings of prompting.) The choice of GPT-4o decides whether and which one model wins the competi- tion, or whether the models were tied in quality. For the qualitative evaluation, we manually pick meta-learned words (shown in Table 2 and Tables 10, 11, and 12 in Appendix F) and examine the syntactic correctness and semantic appropriateness of the generated examples. Results For the quantitative evaluation, Table 1 shows the percentages of wins of each of the baseline and the Minnow model on both the BabyLM-10M test portion and Chimera. Across all settings, the Minnow model wins more often than the corresponding baseline except for the instruction-tuned variant on Chimera, demon- strating the improvement brought by Minnow . For the qualitative evaluation, Table 2 shows a word picked from the BabyLM-10M test portion along with its study and generated examples. See Appendix F for addi- tional examples from the BabyLM-10M test portion (Tables 10 and 12) and Chimera (Table 11) and de- tailed analysis of both the baseline and the Minnow model’s generations. A manual analysis of these gen- erated examples reveals that the Minnow model more often generates syntactically correct and semantically plausible new usage examples compared to the base- line, confirming that Minnow finetuning improves the ability to understand and use a new word. Nevertheless, in several cases, the Minnow model still shows obvi- ous syntactic and factual errors and merely rewords the 6 Page 7: Study Example Sentences Minnow Generated Examples Word ‚the first blacksmiths were [new-token] .‚many civilisations were in the area that is now turkey, like the [new-token] , the roman empire and the byzantine empire. ‚spread of hepatoscopy and astrology to [new-token] , etruscans, greeks and romans and to china‚the first major empire in the area was the [new-token] (from the 18th century to the 13th century bce).1. the [new-token] were a peo- ple who lived in the area of turkey. 2. perhaps the most famous and widely used alchemical symbol, first popularized by [new-token] alchemists, is the ouroboros.hittites Table 2: New examples generated for a word from the BabyLM-10M test portion by the Minnow model. The first one is generated by greedy decoding, and the second one by sampling with top-p= 0.92. The Minnow model learns that hittites is an ancient ethnic group. However, the greedy-decoded example copies the information (turkey) from the study example, while the sampled example makes seemingly plausible but factually incorrect generalizations (the earliest known ouroboros is found in ancient Egyptian text.) study examples. 5.4 Definition Generation To further probe how well Minnow finetuning helps the model understand a new word, we prompt each model to generate a definition for the word given one or a few usage examples. We again use the methodol- ogy of Teehan et al. (2024) for definition generation and evaluation, as well as the two evaluation datasets they used: CoLLEGe-DefGen, which they created, and the Oxford dataset (Gadetsky et al., 2018). CoLLEGe- DefGen was constructed by selecting 954 words from WordNet (Miller, 1995) and prompting GPT-4 (OpenAI, 2023) to generate one definition and five usage exam- ples for each word. The model generates a definition from one, two, or three usage examples sampled for each word in this dataset (i.e., in 1-, 2-, or 3-shot set- tings). The Oxford test set consists of 12,232 words, each with a definition and a usage example collected from the Oxford Dictionary. The model generates a def- inition from the only usage example for each word in this dataset (i.e., in a 1-shot setting). To generate a defi- nition, we prompt the model with the sequence of the usage example(s) (as in Section 5.3) followed by “The word [new-token] in the above sentence(s) is defined as "”6([new-token] is instead the placeholder token or pseudoword, as appropriate). For additional compar- isons with models expected to do especially well on this task, we also evaluate specialized definition-generation models: Giulianelli et al.’s (2023) FLAN-T5 models (Chung et al., 2024). See Appendix G for details of data preprocessing and the specialized models. For the quantitative evaluation, we perform two types of comparison. The first type compares the model- generated and ground-truth definitions for each word by computing BERTScore F1 (Zhang et al., 2020) and ROUGE-L (Lin, 2004). The second type compares a pair of definitions generated from Llama-3 8B baseline and a Minnow model finetuned from it. Similarly to what we did in Section 5.3, we ask GPT-4o a question (without usage examples): “Which of the following is a better definition for the word ‘ Word ’, or they tie?” where 6The prompt ends with a double quotation mark ("), so that the model will continue with a definition ending at another double quotation mark. This makes extracting definitions easy.Word is the ground-truth word form, followed by three shuffled options including the two generated definitions and one “Tie” (see Appendix E for detailed prompting settings).7For the qualitative evaluation, we manually inspect 1-shot generated definitions for words from each dataset (presented in Table 4 and Tables 15 and 16 in Appendix G). Results For the quantitative evaluation, we first present the 1-shot scores of comparing the model- generated and ground-truth definitions for Llama-3 8B baselines and the Minnow models in Table 3. In Ap- pendix G, we present 1-shot scores for all models (in- cluding the specialized FLAN-T5 models and the origi- nalCoLLEGe model) in Table 13 and averaged 1-, 2-, and 3-shot results on CoLLEGe-DefGen in Table 14. In all of these settings, Minnow finetuning improves the baseline scores by 0.3–1.5 on BERTScore F1 and 3.1– 5.3 on ROUGE-L . On CoLLEGe-DefGen, the Minnow model finetuned from the instruction-tuned Llama-3 8B outperforms all other models across all settings. On Oxford, the Minnow models finetuned from both vari- ants perform comparably well, but they are inferior to the largest specialized FLAN-T5 by 2.9 on ROUGE-L . However, note that our Minnow finetuning is neither tailored for generating definitions nor using these spe- cific definition datasets. In Table 1, when comparing the definitions generated from each baseline and a Minnow model finetuned from that baseline, the latter is more often favored over the corresponding baselines for both Llama-3 8B variants. For the qualitative evaluation, Table 4 shows Minnow - model-generated and ground-truth definitions for words from CoLLEGe-DefGen (see Tables 15 and 16 in Ap- pendix G for additional examples from CoLLEGe- DefGen and Oxford). To summarize our manual anal- ysis, we find that definitions generated by the Minnow model often capture most of the word meanings, form reasonable inferences from the contexts, and outper- form the baseline. However, they are not always precise compared to the ground-truth definitions. 7We only perform this comparison on the CoLLEGe- DefGen dataset due to the large scale of the Oxford dataset. 7 Page 8: Model CoLLEGe-DefGen Oxford Variant Method BERTScore F1 ROUGE-L BERTScore F1 ROUGE-L pre- trainedbaseline 85.1 14.9 83.2 11.0 +Minnow 85.4 18.7 84.7 16.3 instruction- tunedbaseline 85.3 17.6 83.6 12.5 +Minnow 85.8 20.7 84.7 16.5 Table 3: Quantitative evaluation of generated definitions by comparing them with ground-truth definitions. See Table 13 in Appendix G for results from all models. We generate a definition from only one example (1-shot). All definitions are generated with greedy decoding. Scores of Minnow models are averaged across three runs. Finetuning with Minnow improves the baseline models on both datasets and both metrics, and the Minnow model finetuned from the instruction-tuned variant of Llama-3 8B performs the best. Example Sentence Minnow Definition True Definition Word After his thorough inspection of the antique pocket watch, the bespecta- cled collector sighed, claiming it was a [new-token] , much to the seller’s dis- appointment.a thing that is not gen- uine or authentica deception or trick swiz Despite his greed, the businessman felt bound by a [new-token] to maintain ethical practices.a promise or agreement to do somethinga moral obligation or command that is unconditionally and uni- versally bindingcategorical imperative Table 4: Definitions for two words from CoLLEGe-DefGen generated by the Minnow model finetuned from instruction-tuned Llama-3 8B with greedy decoding. Each definition is generated using the single example sentence shown and provided in context. The generated definitions managed to infer the core semantic features from the examples, though they are not precise enough compared to the true definitions. In this first example, the Minnow definition for the word “ swiz” captures the word’s core meaning of fakeness, which is a reasonable inference from the example, but misses the intentional aspect, a nuance of the true definition. In the second example, the Minnow definition for “ categorical imperative ” captures the core meaning of obligation, which is a reasonable contrast to the businessman’s greed, but misses the “unconditionally and universally binding” aspect in the true definition. 6 Conclusion In this work, we present Minnow , a new method to im- prove language models’ capability to learn a new word from a few in-context usage examples. Minnow success- fully induced this ability in models trained from scratch with human-scale linguistic data, as indicated by their performances in differentiating new words (Section 4). Minnow finetuning further improved the word learning performance of a pre-trained LLM ( Llama-3 8B ), as demonstrated in their improvements in differentiating new words (Section 5.1 and 5.2) as well as in generat- ing new usage examples (Section 5.3) and definitions (Section 5.4) for the learned new words. In summary, this word-learning capability enables models to system- atically and flexibly understand and use a new word in novel contexts, and can be immediately transferred to other words and tasks without additional training. The efficacy of Minnow , or meta-learning in general, suggests that human-level efficiency in linguistic gen- eralizations may be acquired through practicing over many instances of learning tasks, without presuming strict, explicit inductive biases (Russin et al., 2024; Irie and Lake, 2024). Whether models achieve the general- izations in this work through human-like mechanisms, such as systematicity and categorical abstraction, re- mains for future analysis.7 Limitations Learning Settings In this work, we consider word learning only in the text modality, in which the lan- guage model learns the meaning from the distribution of words. However, many words have real-world ref- erences, which usually accompany human word learn- ing. We also use aggregated data from multiple sources, not from single-human/child input. Thus, a multimodal, grounded setting of word learning using a single agent’s input would be more realistic. In addition, we only consider learning a single new word on the fly. However, in real-world learning, both humans and models need to continually learn multi- ple words, usages, and even abstract rules (Mueller et al., 2024). Implementing this continual learning set- ting would be another future direction. Novelty of New Words When Testing LLMs When testing LLMs (Section 5), the words and example sen- tences we use may already exist in the pre-training data, potentially allowing LLMs to recall known word meanings rather than learn genuinely new ones (note, however, the Chimera dataset introduces new concepts which are unusual and not lexicalized). The performance of the baseline LLMs shows that, even with this poten- tial worry, there is room for improvement, which the Minnow-finetuned LLMs are able to achieve. Models trained from scratch with Minnow do not have this limitation. Their training data explicitly ex- 8 Page 9: cludes held-out test words (Section 4). Therefore, their test performance reflects their genuine ability to learn novel words, and this ability can be developed by Minnow. Acknowledgements We thank Michael Hu, Will Merrill, Sophie Hao, Byung- Doh Oh, Shauli Ravfogel, and other members of the Computation and Psycholinguistics Lab for insightful and helpful discussions and comments. This work is sup- ported by the National Science Foundation under NSF Award 1922658 (for Wentao Wang) and IIS-2239862. This work is also supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. 9 Page 10: References Suraj Anand, Michael A Lepori, Jack Merullo, and Ellie Pavlick. 2025. Dual process learning: Controlling use of in-context vs. in-weights strategies with weight forgetting. InThe Thirteenth International Conference on Learning Representations . Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Ben- jamin Thérien, Phil Wang, and Samuel Weinbach. 2023. GPT-NeoX: Large scale autoregressive language modeling in pytorch. Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, and Gábor Bella. 2024. Evaluating subword tokenization: Alien subword composition and oov generalization challenge. arXiv preprint arXiv:2404.13292 . Jean Berko. 1958. The child’s learning of english morphology. WORD , 14:150–177. Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for an- alyzing large language models across training and scaling. InInternational Conference on Machine Learning , pages 2397–2430. PMLR. P Bloom. 2000. How Children Learn the Meanings of Words . MIT Press, Cambridge, MA. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems , volume 33, pages 1877–1901. Curran Associates, Inc. Susan Carey and Elsa Bartlett. 1978. Acquiring a single new word. Papers and Reports on Child Language Develop- ment , 15:17–29. Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. 2022. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 719–730, Dublin, Ireland. Association for Computational Linguistics. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research , 25(70):1–53. Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. 2023. Meta-in- context learning in large language models. In Advances in Neural Information Processing Systems , volume 36, pages 65189–65201. Curran Associates, Inc.Michael C. Frank. 2023. Bridging the data gap between chil- dren and large language models. Trends in cognitive sci- ences . Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. 2018. Conditional generators of words definitions. In Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 266–271, Melbourne, Australia. Association for Computa- tional Linguistics. Mario Giulianelli, Iris Luden, Raquel Fernandez, and Andrey Kutuzov. 2023. Interpretable word sense representations via definition generation: The case of semantic change anal- ysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3130–3148, Toronto, Canada. Association for Computational Linguistics. Harold Stanley Heaps. 1978. Information retrieval: computa- tional and theoretical aspects . Academic Press, Inc. Aurélie Herbelot and Marco Baroni. 2017. High-risk learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 304–309, Copenhagen, Den- mark. Association for Computational Linguistics. John Hewitt. 2021. Initializing new word embeddings for pretrained language models. Ziniu Hu, Ting Chen, Kai-Wei Chang, and Yizhou Sun. 2019. Few-shot representation learning for out-of-vocabulary words. In Proceedings of the 57th Annual Meeting of the As- sociation for Computational Linguistics , pages 4102–4112, Florence, Italy. Association for Computational Linguistics. Qian Huang, Eric Zelikman, Sarah Chen, Yuhuai Wu, Gregory Valiant, and Percy S Liang. 2024. Lexinvariant language models. Advances in Neural Information Processing Sys- tems, 36. Kazuki Irie and Brenden M. Lake. 2024. Neural networks that overcome classic challenges through practice. ArXiv , abs/2410.10596. Mikhail Khodak, Nikunj Saunshi, Yingyu Liang, Tengyu Ma, Brandon Stewart, and Sanjeev Arora. 2018. A la carte em- bedding: Cheap but effective induction of semantic feature vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 12–22, Melbourne, Australia. Association for Computational Linguistics. Najoung Kim, Tal Linzen, and Paul Smolensky. 2022. Uncon- trolled lexical exposure leads to overestimation of composi- tional generalization in pretrained models. arXiv preprint arXiv:2212.10769 . Najoung Kim and Paul Smolensky. 2021. Testing for gram- matical category abstraction in neural language models. In Proceedings of the Society for Computation in Linguistics 2021 , pages 467–470, Online. Association for Computa- tional Linguistics. Brenden M. Lake and Marco Baroni. 2023. Human-like sys- tematic generalization through a meta-learning neural net- work. Nature , 623:115 – 121. Andrew K Lampinen and James L McClelland. 2017. One- shot and few-shot learning of word embeddings. arXiv preprint arXiv:1710.10280 . 10 Page 11: Sander Land and Max Bartolo. 2024. Fishing for magikarp: Automatically detecting under-trained tokens in large lan- guage models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 11631–11646, Miami, Florida, USA. Association for Computational Linguistics. Angeliki Lazaridou, Marco Marelli, and Marco Baroni. 2017. Multimodal word meaning induction from minimal expo- sure to natural text. Cognitive science , 41 Suppl 4:677–705. Chin-Yew Lin. 2004. ROUGE: A package for automatic eval- uation of summaries. In Text Summarization Branches Out , pages 74–81, Barcelona, Spain. Association for Computa- tional Linguistics. Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics , pages 5210–5217, Online. Association for Computational Linguistics. Llama Team, Meta AI. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 . Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight de- cay regularization. 7th International Conference on Learn- ing Representations, ICLR 2019 . Thang Luong, Richard Socher, and Christopher Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Seven- teenth Conference on Computational Natural Language Learning , pages 104–113, Sofia, Bulgaria. Association for Computational Linguistics. Brian MacWhinney. 1992. The CHILDES project: tools for analyzing talk. Child Language Teaching and Therapy , 8:217 – 218. Sabrina J Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chen- glei Si, Wilson Y Lee, Benoît Sagot, et al. 2021. Be- tween words and characters: A brief history of open- vocabulary modeling and tokenization in nlp. arXiv preprint arXiv:2112.10508 . Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In International Conference on Learning Representations . George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM , 38:39–41. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2791–2809, Seattle, United States. Association for Computational Linguistics. Aaron Mueller, Albert Webson, Jackson Petty, and Tal Linzen. 2024. In-context learning generalizes, but not always ro- bustly: The case of syntax. In Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages 4761–4779, Mexico City, Mexico. Association for Computational Lin- guistics. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774 .OpenAI. 2024. GPT-4o system card. arXiv preprint arXiv:2410.21276 . Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learn- ing. Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Watten- berg, and Hidenori Tanaka. 2025. ICLR: In-context learn- ing of representations. In The Thirteenth International Conference on Learning Representations . Jeffrey Pennington, Richard Socher, and Christopher D. Man- ning. 2014. GloVe: Global vectors for word representation. InConference on Empirical Methods in Natural Language Processing . Yasaman Razeghi, Robert L Logan IV , Matt Gardner, and Sameer Singh. 2022. Impact of pretraining term frequen- cies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 840–854, Abu Dhabi, United Arab Emirates. Associ- ation for Computational Linguistics. Jacob Russin, Sam Whitman McGrath, Danielle J. Williams, and Lotem Elber-Dorozko. 2024. From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks. ArXiv , abs/2405.15164. Timo Schick and Hinrich Schütze. 2019. Learning seman- tic representations for novel words: Leveraging both form and context. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 6965–6973. Timo Schick and Hinrich Schütze. 2020. Rare words: A major problem for contextualized embeddings and how to fix it by attentive mimicking. In AAAI Conference on Artificial Intelligence . Jingyuan Sun, Shaonan Wang, and Chengqing Zong. 2018. Memory, show the way: Memory based few shot word representation learning. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Pro- cessing , pages 1435–1444, Brussels, Belgium. Association for Computational Linguistics. Ryan Teehan, Brenden Lake, and Mengye Ren. 2024. CoL- LEGe: Concept embedding generation for large language models. In First Conference on Language Modeling . Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V . Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melissa Hall Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 . 11 Page 12: Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell. 2023. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausi- ble corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Lan- guage Learning , pages 1–34, Singapore. Association for Computational Linguistics. Jason Wei, Dan Garrette, Tal Linzen, and Ellie Pavlick. 2021. Frequency effects on syntactic rule learning in transform- ers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 932–948, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2017. A broad-coverage challenge corpus for sentence understand- ing through inference. In North American Chapter of the Association for Computational Linguistics . Aditya Yedetore, Tal Linzen, Robert Frank, and R. Thomas McCoy. 2023. How poor is the stimulus? evaluating hierar- chical generalization in neural networks trained on child- directed speech. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 9370–9393, Toronto, Canada. Asso- ciation for Computational Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. ICLR . George Kingsley Zipf. 1949. Human behavior and the princi- ple of least effort . Addison-Wesley Press. 12 Page 13: A Word Usage Dataset Creation As we mentioned in Section 3.2, we construct one dataset from each of two corpora: CHILDES (MacWhin- ney, 1992) and BabyLM-10M (Warstadt et al., 2023). The CHILDES dataset is licensed for use under a CC BY-NC-SA 3.0 license.8Our scientific use is under the terms of the license.9We did not find the license of the BabyLM dataset, which aggregated multiple public datasets. Since there is plenty of published work using this public dataset, we believe our scientific use does not violate any terms or conditions. In the following, we describe how we preprocess these two corpora and create a word usage dataset from each corpus. Preprocessing Since the basic units of our focus are words (as opposed to word pieces in other tokeniza- tion schemes), we need to identify words in the text. To achieve this, we apply the same word-level tokeniza- tion to all datasets (for consistency) and mark word boundaries by whitespace during preprocessing. Mod- els trained from scratch use this word-level tokenization. When the text is used in finetuning Llama-3 , which comes with its pre-trained subword tokenizer, we re- move the unnatural spaces introduced by the word- level tokenization and tokenize the text again with Llama-3 tokenizer, so the text format becomes closer to its pre-training data (See the Finetuning paragraph in Appendix B for further details of this process). For CHILDES data, we preprocess the data in the same way as Yedetore et al. (2023) did, which uses chil- dren’s input in the North American English portion, but we do not split and unk the data at the preprocessing stage. For BabyLM data, we use the data in the 10M track of the BabyLM Challenge 2023, which mixes 10 portions, each from a different data source (child- or adult-oriented, speech transcription or written text like Wikipedia). We exclude the QED portion for its poor quality (also mentioned in the 2nd BabyLM Challenge). We apply word-level tokenization on untokenized por- tions, and then split the text into sentences using heuris- tics. We use spaCy for all word-level tokenization along with Part-Of-Speech tagging. We lowercase all text be- fore preprocessing to unify the capitalization of words in different places. We deduplicate sentences and re- move sentences having less than 1 word (not counting punctuation). Assigning sentences and splitting To create a dataset from a corpus, we first get the token frequencies of all words. (Here, a word means a word-form. We discuss its implications in Appendix H.) Then we select the set of words to be meta-learned. We will only consider nouns, verbs, adjectives, and adverbs to be meta-learned (a word’s syntactic category is based on the word’s most frequent Part-Of-Speech tag). We choose two thresholds for meta-learned words: the maximum frequency of a meta-learned word and the minimum number of exam- 8https://talkbank.org/share/rules.html 9https://creativecommons.org/licenses/ by-nc-sa/3.0/ples per meta-learned word. We use a greedy algorithm to assign each sentence in the corpus to the example set of at most one potential meta-learned word that oc- curs in the sentence, so each meta-learned word has at least the minimum number of examples. This ensures that the model cannot infer the identity of the word masked by the placeholder token from other sentences. These words and their example sets constitute the meta- learning component of the dataset. We include the re- maining sentences not assigned to any meta-learned word in the language-modeling component. Finally, we split both the meta-learning component (by word) and the language-modeling component (by sentence) into training (80%), validation (10%), and test (10%) por- tions. When training models from scratch, we build the vo- cabulary from the words occurring with a minimum frequency in the training portion (same as the minimum number of examples per meta-learned word) while ex- cluding all meta-learned words. This ensures that meta- learned words, like the lowest-frequency words, are out- of-vocabulary and will be replaced by <unk> tokens, so they will never be learned in-weights. Statistics of our created datasets are shown in Table 5. Read our code for full details. 13 Page 14: CHILDES BabyLM-10M max. freq. of meta-learned words 200 15 min. #uses of meta-learned words 5 5 vocabulary size 2179 22,696 portion training valid. test training valid. test meta- learning#meta-learned words 7790 973 975 15,821 1977 1979 total #uses 201,957 26,449 26,234 108,466 13,552 13,563 mean #uses 25.93 27.18 26.91 6.86 6.85 6.85 total #tokens 1,899,159 245,509 243,387 2,072,560 260,701 257,933 mean sentence length 9.40 9.28 9.28 19.11 19.24 19.02 unk rate 3.32% 3.28% 3.28% 3.61% 3.78% 3.91% language modeling#sentences 508,630 63,578 63,580 521,911 65,238 65,240 total #tokens 3,927,120 492,280 490,990 5,721,893 715,553 715,111 mean sentence length 7.72 7.74 7.72 10.96 10.97 10.96 unk rate 1.00% 1.03% 1.00% 1.44% 1.49% 1.47% total #tokens 5,826,279 737,789 734,377 7,794,453 976,254 973,044 Table 5: Dataset statistics. All statistics are based on tokens, which mostly correspond to words except punctuations due to our word-level tokenization. “unk rate” is the percentage of out-of-vocabulary tokens, which are replaced by <unk> , in all tokens. Unk rate is slightly higher in the validation and test portions than the training portion because we build the vocabulary from the training portion. As shown by the mean sentence lengths, the meta-learning sentences are longer on average than the language modeling sentences, since meta-learned words are of lower frequency and thus are usually in more complex sentences. We manually tune the two thresholds of meta-learned words so we have enough number of meta-learned words while the unk rate is not too high. 14 Page 15: B Model and Training Configurations Training from scratch We slightly modify the config- uration of Pythia-160M (Biderman et al., 2023), which uses the Transformer architecture GPT-NeoX (Ando- nian et al., 2023). The configuration has 12layers and a hidden dimension size of 768. We change the vocab- ulary size according to the corresponding dataset, as shown in Table 5. We also include three special tokens in the vocabulary: the placeholder token [new-token] , the separator token <sep> , and <unk> , as mentioned in Section 4. We change the Pythia configuration to tie the input and output embeddings. This makes the model parameter counts smaller, 86.7M and 102.5M for the model trained on CHILDES and BabyLM-10M, respectively. For both models, we use batch size (i.e., number of episodes/sequences per batch) 8and AdamW optimizer (Loshchilov and Hutter, 2019) with initial learning rate 3ˆ10´4, and reduce the learning rate by multiplying 0.1when the validation loss has stopped improving for 2epochs. We apply weight decay 0.07 and0.15when training on the CHILDES and BabyLM- 10M datasets, respectively. Other configurations, such as no dropout, are kept the same as Pythia-160M . For each setting, we run 3times with random seed t0,1,2u. Each run is performed on a single V100 GPU for 30 epochs (9–18 hours). Finetuning We finetune Llama-3 8B (Llama Team, Meta AI, 2024) with Minnow on each of the CHILDES and BabyLM-10M datasets, but we refer to the models finetuned on BabyLM-10M by default, as we mentioned in Section 5. We finetune from both the pre-trained and instruction-tuned variants of Llama-3 8B , but we refer to the models finetuned from the pre-trained variant by default, presenting results of finetuning from the instruction-tuned variant only in the generative settings, where their performance may differ significantly due to their different capabilities to follow the prompt. We use two reserved special tokens in Llama-3 tokenizer vocabulary as the placeholder token and the separator token. To make the tokenization more natural to the model’s pre-training data, we clean up tokenization spaces in the text (e.g., the space before “,”, “.”, or “’s”) introduced by the word-level tokenization during preprocessing and make the placeholder token absorbs any preceding spaces of the word. Finetuning is mini- mally parameter-efficient: We finetune only the input and output embeddings of the two special tokens, while freezing all other parameters. Before finetuning, the in- put/output embedding of either token is initialized to the mean of all input/output embeddings (Hewitt, 2021). When finetuning the model on CHILDES with 5 ex- amples per episode, we use batch size (i.e., number of episodes/sequences per batch) 32and initial learning rate3ˆ10´3and truncate the sequence to the max length of 80tokens to control the memory usage. When finetuning the model on CHILDES with 10 examples per episode, we use batch size 8and initial learning rate 3ˆ10´4and truncate the sequence to the max lengthof180tokens. When finetuning the model on BabyLM- 10M with 5examples per episode, we use batch size 16and initial learning rate 1ˆ10´3and truncate the sequence to the max length of 160tokens. Other settings are the same as when training from scratch except that we do not apply weight decay. Each run is performed on a single A100 GPU for 15 epochs on CHILDES (33 hours) or 12 epochs on BabyLM-10M (48 hours). 15 Page 16: C Held-out Word Classification As we mentioned in Section 3.3, we need different meta- learned words in the same group. Therefore, different from training, we sample only one episode of Kexam- ples per word from the validation/test portions so we do not repeat the same word in a classification group. We also fix the shuffle order so all models are evaluated on the same classification task instances. We experimented with training models with KP t5,10uexamples per episode on CHILDES and BabyLM-10M and evaluated each of them on the corresponding dataset with the same KandCPt4,8u. Training models with K“10ex- amples per episode on BabyLM-10M was unsuccessful because the concatenated sequence was too long, ex- ceeding the GPU memory, so we do not have results in this setting. We are aware of the weaknesses of this task. Dis- criminating a new word from an arbitrary set of other new words is a relatively weak test of word meaning learning. The task could be easy simply because dif- ferent words are used in very different contexts, so the conditional likelihood may reflect just the coherence of the usage contexts between study and query examples, not the meaning of the new word (we demonstrate this point by an additional baseline below where we present the model only the usage contexts without new words). In addition, results from the task do not tell us what features of word meanings the model is learning. Our syntactic category classification task addresses these concerns by focusing on the syntactic aspect and break- ing the semantic coherence between study and query examples (Section 5.2). Below, we describe two baselines we run on this task. Baseline: Llama-3 8B learning a pseudo-word in con- text ( Llama-3 8B with ‘ dax’)This is the baseline model introduced in Section 3.4. We follow the format described there and additionally prepend a prompt to make the performance better: “The following lines are lowercased example sentences using a new word ‘ dax’ in random order, one per line:”. (We discuss the conse- quence of using a same pseudo-word in Appendix H.) Additional Baseline: Llama-3 8B modeling the coher- ence of usage contexts ( Llama-3 8B with ‘’) This is the additional baseline to evaluate the effectiveness of utilizing just the coherence of the contexts, as we discussed above. We remove the new word from each example (equivalent to replacing the new word with an empty string), so only the usage context of each example is retained. For these baselines, we also experimented with the instruction-tuned variant of Llama-3 8B but it performs worse on this task. Table 6 shows all models’ held-out word classifi- cation results on the test portions of CHILDES and BabyLM-10M datasets. 16 Page 17: dataset K C Minnow from scratchLlama-3 8B with ‘’Llama-3 8B with ‘ dax’Llama-3 8B +Minnow CHILDES54 72.3(1.6) 58.33 71.09 79.1(0.5) 8 59.8(0.4) 46.49 60.02 70.4(0.2) 104 75.1(0.7) 66.56 76.53 84.9(0.2) 8 63.4(1.5) 56.17 66.05 75.9(0.6) BabyLM-10M 54 77.4(0.5) 70.45 78.39 86.5(0.6) 8 67.5(0.7) 60.12 69.74 80.5(1.0) Table 6: Accuracy (%) of held-out word classification on the CHILDES and BabyLM-10M test sets. We show the mean and the standard deviation (in the bracket) of 3 runs. “ Minnow from scratch” means models trained from scratch on the corresponding dataset. “Llama-3 8B with ‘”’ means the baseline model without prompt and remove the new word (i.e., replace the new word with an empty string). “Llama-3 8B with ‘ dax”’ means the baseline model with prompt learning the new word ‘ dax’. We use K´1study examples in this classification task, and models except the baselines are trained/finetuned on Kexamples per training episode so they see the same number of examples during training and evaluation. Cis the number of words in each group, so we will have tnepisodes Cugroups. Note that we discard the last batch of less than Cepisodes, so the used numbers of episodes are slightly smaller. Results of “ Llama-3 8B with ‘”’ show that the coherence of the context already provides better-than-chance accuracy on this classification task. Results of “ Llama-3 8B with ‘ dax”’ show that the pre-trained LLM already performs well. However, “ Llama-3 8B +Minnow ” outperforms the baselines by a large margin, showing the effectiveness of our method. Models finetuned with Minnow from the instruction-tuned variant of Llama-3 8B perform worse than or close to the pre-trained variant here (the instruction-tuned variant finetuned with Minnow has 86.3% (4-way) and 80.1% (8-way) mean classification accuracies; the instruction-tuned variant with ‘ dax’ has 75.2% (4-way) and 66.0% (8-way) classification accuracies), so we do not include their results here. 17 Page 18: D Syntactic Category Classification As we mentioned in Section 5.2, we use the methodol- ogy of Kim and Smolensky (2021) and the dataset they constructed. The dataset was constructed from MNLI, a Natural Language Inference dataset (Williams et al., 2017). The task is to discriminate between a pair of words in two different syntactic categories. They con- sider 4 syntactic categories: noun, verb, adjective, and adverb. Therefore, they have 6 pairs of categories for discrimination. For each category pair, the dataset con- tains two signal contexts (one for each category; we use them as the study examples) and 200 test sentences using a word unambiguously in either category (100 for each category; we use them as the query examples). The main difference between our approach and that of Kim and Smolensky (2021) is that, instead of finetuning a new word embedding on each signal context, we apply in-context learning, using each signal context as an in- context study example of the new word. Read Kim and Smolensky (2021) for further details. Results from models trained from scratch, Llama-3 8B baseline and models finetuned from Llama-3 8B on the 6 category pairs and their mean are visualized in Figure 2. Table 7 shows detailed results from Llama-3 8B baseline and Llama-3 8B finetuned with Minnow on BabyLM-10M. Table 8 shows detailed results from models trained from scratch on both datasets. 18 Page 19: Mean N vs. V N vs. Adj N vs. Adv V vs. Adj V vs. Adv Adj vs. Adv Category Pair020406080100Accuracy Minnow from scratch on CHILDES Minnow from scratch on BabyLM-10M Llama-3 8B baseline Llama-3 8B +Minnow on BabyLM-10MFigure 2: Syntactic classification accuracy. Error bar shows the 95% confidence interval given 3 runs. “ Minnow from scratch on CHILDES” (blue) and “ Minnow from scratch on BabyLM-10M” (orange) mean the models trained from scratch with Minnow on CHILDES and BabyLM-10M, respectively. (These models have a closed vocabulary, so many words in the dataset will be Out-Of-V ocabulary and be presented as <unk> , which could make the task easier.) “ Llama-3 8B baseline” (green) means Llama-3 8B baseline with pseudo-word “ dax”. “Llama-3 8B +Minnow on BabyLM-10M” (red) means Llama-3 8B finetuned with Minnow on BabyLM-10M. “N”, “V”, “Adj”, and “Adv” are short for noun, verb, adjective, and adverb, respectively. “Mean” is the mean across all category pairs. The black dashed line marks the chance level (50%). “ Llama-3 8B +Minnow on BabyLM-10M” (red) shows improvement over “ Llama-3 8B baseline” (green) in all category pairs, with mean accuracy risen from 64% to 83%. Note that “ Minnow from scratch on BabyLM-10M” (orange) has a 77% mean accuracy, much better than the baseline accuracy and even comparable to the Minnow models finetuned from Llama-3 8B on many category pairs, again demonstrating its data efficiency. Llama-3 8B baseline Llama-3 8B +Minnow Cat. 1 Cat. 2 Acc. Acc. (1 ą2) Acc. (2ą1) Acc. Acc. (1 ą2) Acc. (2ą1) Noun Verb 71.0 43 99 86.3(1.5) 74.7(1.7) 98.0(1.6) Noun Adjective 66.0 79 53 84.0(2.2) 71.3(4.6) 96.7(0.5) Noun Adverb 64.0 55 73 81.3(2.2) 75.7(1.7) 87.0(2.9) Verb Adjective 70.5 49 92 92.7(0.5) 90.0(2.2) 95.3(1.2) Verb Adverb 53.0 85 21 78.8(5.2) 90.0(2.4) 67.7(12.5) Adjective Adverb 61.5 42 81 72.8(0.2) 57.3(2.6) 88.3(3.1) Table 7: LLMs’ accuracies (%) of distinguishing two syntactic categories in novel contexts. We show the mean and the standard deviation (in the bracket) of 3 runs. ‘Acc. (1 ą2)’ denotes the accuracy on the set of sentences where Category 1 should be preferred over Category 2 (e.g., assigning a higher probability to a noun in a noun-expecting context for row 1), and vice versa. Column ‘Acc.’ lists the aggregate accuracy. “ Llama-3 8B +Minnow ” have accuracies significantly better than chance except distinguishing adjective from verb (row 5). Additionally, “ Llama-3 8B +Minnow ” improves over Llama-3 8B baseline in differentiating most category pairs except discriminating nouns from adjectives (row 2), showing the effectiveness of finetuning with Minnow. Minnow from scratch on CHILDES Minnow from scratch on BabyLM-10M Cat. 1 Cat. 2 Acc. Acc. (1 ą2) Acc. (2ą1) Acc. Acc. (1 ą2) Acc. (2ą1) Noun Verb 84.5(2.3) 79.7(3.7) 89.3(4.5) 93.5(1.8) 90.0(2.2) 97.0(1.4) Noun Adjective 73.5(0.4) 50.7(2.9) 96.3(2.1) 86.2(2.5) 79.7(5.4) 92.7(1.9) Noun Adverb 62.2(1.8) 90.3(4.1) 34.0(6.4) 67.8(3.7) 86.3(3.1) 49.3(5.8) Verb Adjective 92.3(1.4) 90.0(2.8) 94.7(1.2) 95.7(1.2) 93.0(2.4) 98.3(0.5) Verb Adverb 38.5(6.5) 57.3(14.7) 19.7(1.7) 56.7(5.3) 68.7(5.8) 44.7(11.4) Adjective Adverb 53.8(5.3) 44.0(5.4) 63.7(10.1) 62.3(1.9) 59.0(6.5) 65.7(4.1) Table 8: Accuracies (%) of distinguishing two syntactic categories in novel contexts for models trained from scratch with Minnow . We show the mean and the standard deviation (in the bracket) of 3 runs. ‘Acc. (1 ą2)’ denotes the accuracy on the set of sentences where Category 1 should be preferred over Category 2 (e.g., assigning higher probability to a noun in a noun-expecting context for row 1), and vice versa. Column ‘Acc.’ lists the aggregate accuracy. Both models perform better than chance on many category pairs, suggesting that models can develop some ability to one-shot learn the syntactic category of a word from human-scale data with Minnow. 19 Page 20: E Comparing Generations For new usage example generation (Section 5.3), we show GPT-4o the following text format: The following lines are shuffled lowercased example sentences using a new word ‘dax’, one per line: * EXAMPLE-1 * EXAMPLE-2 * EXAMPLE-3 * EXAMPLE-4 Please answer in a single uppercase letter: Which of the following is a better next ex- ample for the word ‘dax’, or they tie? A) OPTION-A B) OPTION-B C) OPTION-C where OPTION-A, OPTION-B, OPTION-C are shuffled generation-1, generation-2, and “Tie”. For definition generation (Section 5.4), we do not have the examples (and the prompt before them) and in- stead have the direct prompt before the options: “Please answer in a single uppercase letter: Which of the fol- lowing is a better definition for the word ‘ Word ’, or they tie?” where Word is the ground-truth word form. We always get the first letter (A, B, or C) of the GPT-4o response as the choice. Tables 1 and 9 show the results of comparing Llama-3 8B baseline (pre-trained to instruction-tuned) to the Minnow model finetuned from that baseline (with random seed 0) on new examples and definitions gener- ated by greedy decoding and top-p= 0.92, respectively. New Example Definition Variant Method BabyLM- 10M testChimera CoLLEGe- DefGen Pre- trainedbaseline 39 39 25 +Minnow 53 42 31 Instruction- tunedbaseline 46 52 33 +Minnow 47 36 28 Table 9: Percentages of wins of each model when compar- ing the generations from Llama-3 8B baseline (pre-trained to instruction-tuned) with the Minnow model finetuned from that baseline, judged by GPT-4o . The left two datasets are for new usage example generation in Section 5.3, and the right-most one is for definition generation in Section 5.4. Each new exam- ple or definition is generated by top-p= 0.92. The percentage of ties is the remaining after subtracting the win percentages of the two models. GPT-4o more frequently chooses the Minnow model as the winner compared to the corresponding base- line, except for the instruction-tuned model on Chimera and CoLLEGe-DefGen. 20 Page 21: F Evaluation of Generated New Usage Examples As we mentioned in Section 5.3, we sample study ex- amples from two datasets: the BabyLM-10M test por- tion and the Chimera dataset (Lazaridou et al., 2017). Statistics of the BabyLM-10M test portion is in Table 5, Appendix A. The Chimera dataset contains 33 chimeras. For each chimera, we merge all examples in its 10 “pas- sages” (each with a different pseudo-word in place of the word) in the original dataset, so we only have 33 dif- ferent words to learn. We do not use the pseudo-words in the original dataset. Tables 10 and 11 show samples of example sentences from the BabyLM-10M test portion and the Chimera dataset along with the new usage examples generated by both the Llama-3 8B baseline and the Minnow model finetuned from that baseline. Table 12 contains error analysis of new examples generated from the BabyLM- 10M test portion, extending Table 2 in Section 5.3. 21 Page 22: Study Example Sentences Minnow Baseline Judgment and Reason Word ‚tales similar to " bluebeard " include " the fitcher’s bird " by the brothers [new-token] and the english tale " mr. fox ".‚these types of tales include royalty and fantastical magic, many of which are published by brothers [new-token] and hans christian anderson. ‚previously on [new-token] ...‚the house in the wood from the german of [new-token] .previously in [new-token]the story is ac- tually entitled " the bridal of trier " as men- tioned on brothers [new-token] site.Baseline wins. Minnow finetuned output sim- ply changes the prepo- sition of an example, which is likely incor- rect. The baseline out- put copies “brothers [new-token] ”, which is likely correct.grimm ‚men were sent to search the [new-token] , and there, on the very night before the deed was to be done, guy fawkes was found waiting till the time should come to set fire to the powder. ‚the actual [new-token] they’ve got some of the old er writings that the, the monks and you’re not allowed to take photographs. ‚ searched all the tunnels and [new-token] . ‚he puts them in [new-token] , anderson shelters, confined spaces, things like that.because they can light it in a [new-token] and blow it with the minimum amount of people.when a young teacher of history at the university of the common- wealth in london, [new-token] began the revolu- tionary research into indian philos- ophy of patanjali that was later to prove so fruitful.Minnow model wins. The word likely refers to a place or space. Only theMinnow finetuned output uses the word as a place.cellars ‚the sorcerer wotan is using the [new-token] of aten to blot out the sun. ‚ unless it involves some kind of enchanted [new-token] ,‚i’ll give her the lucky [new-token] you gave m‚this can be a long ritual, or it can take the form of simply wearing an [new-token] or other token, " against bad luck ".first, she goes with colm and feels a little guilty for keeping his [new-token] .the narrative treatment of [new-token] in the media exemplifies this process.Minnow model wins. The word refers to a rit- ual token. The Minnow finetuned output sug- gests it is an object, which could agree with the inference, while the baseline output does not convey a lot of informa- tion.amulet ‚they are blind, their eyes having be- come reduced to [new-token] lenses un- der the skin, and they have no external ears, just a pair of tiny holes hidden un- der thick hair. ‚it is a [new-token] re- flex, a remnant of the past when pre - hu- man babies clung to their mothers ’ hair as the mothers swung through the trees. ‚however, [new-token] structures may have their original function replaced with another.‚strong evidence for common descent comes from [new-token] struc- tures.p397 the useless wings of flightless beetles are sealed under fused wing covers.the[new-token] structures of the human body are the appendix, the coccyx, the ear- lobes, the tailbone, and the wisdom teeth.the[new-token] structures of the human body are the same as those of other mammals.Tie. The word should be an adjective meaning an inherited biological fea- ture that lost its origi- nal function. Both gen- erations fit this interpre- tation. GPT-4o decides theMinnow model as the winner.vestigial ‚upon the very margin of the ditch, not thirty feet from where they crouched, an iron caldron [new-token] and steamed above a glowing fire ; ‚all round the hull, in the blackness, the rippling current [new-token] and chattered like a little mountain stream. ‚there hung another ket- tle that [new-token] and boiled ; but there was no fire beneath it, either. ‚and when he lifted it up and looked down, there stood a great copper cauldron which [new-token] up and boiled away down there ;the boy looked down into the pot and saw that it [new-token] and boiled away.he turned his back upon them, and went in at the door in the [new-token] , which was a pleasant door to look upon.Minnow model wins. The word should be a past-tense verb mean- ing a physical move- ment of an object, likely caused by steam or cur- rent. The Minnow fine- tuned output is consis- tent with this meaning, but the baseline incor- rectly uses the word as a noun.bubbled Table 10: New examples generated on study examples from the BabyLM-10M test portion by the Minnow model and the Llama-3 8B baseline it is finetuned from, with greedy decoding. In addition to GPT-4o ’s judgment, the first author manually compared each pair of generations and gave the judgment and reason (Judgment and Reason). When the author and GPT-4o compare the generations, they cannot see the ground-truth word but have to infer the possible meaning of the word. When GPT-4o’s judgment disagrees with the author’s, it is mentioned at the end of Judgment and Reason. 22 Page 23: Study Example Sentences Minnow Baseline Judgment and Reason Word ‚Animals such as capybara, jaguars, jacare ([new-token] s) and hyacinth macaws are particularly vulnerable. ‚Nadirpur stared at it as though it were a [new-token] , his face quite drained. ‚The owner will be summonsed to court for illegal possession of a [new-token] without a permit. ‚I went to the cavern to get some, and found a [new-token] in there.‚ The [new-token] was a little more than a metre long, and had a thick, black coat.‚ The [new-token] was a large, black, furry creature with a long, pointed snout.Minnow model wins. An alligator (inferred from the first example) is not furry.alligator + rattlesnake ‚He is principal [new-token] of the En- glish Baroque soloists, with whom he has recorded for Erato and Deutsche Gram- mophon.‚Banners point to the military as- pect of [new-token] music: a Jacobite flag and the first Union flag. ‚" Alexander Bail- lie said that he was impressed by her play- ing and even tried out her [new-token] !‚ Where did the [new-token] idea at the end of Rubina’s Blue Sky Happiness come from?‚The first movement is a [new-token] , the second a minuet, and the third a rondo.‚ The [new-token] is a small, furry, four-legged creature with a long tail.Tie. Both generations are not for an instru- ment.cello + bagpipe ‚4. Allow the stock to settle then skim off the fat using [new-token] or fat sep- arator.‚Add the pasta, cook it for about 12 minutes until just soft, then drain it in a [new-token] .‚Imagine you’ve got a buck- etful of praise in every room of the house with a big [new-token] .‚Plants can be grown in tyres, old metal pots such as buck- ets, watering cans or even a [new-token] !‚2. Put the chicken in a [new-token] and cover with cold water.‚ The [new-token] is a very useful tool for the gardener.Minnow model wins. The word refers to a kitchen utensil, likely a container. The baseline may be misled by the last example but do not notice the word “even” indicating the surprise of the usage.ladle + colander ‚At about the same age I was in a cafe and I spilt some pop on my [new-token] s. ‚It’s also the first International home [new-token] in history that is reversible as well.‚Such as how if you have sufficiently- stretchy pants you could take them off with- out first removing your [new-token] s.‚It was in a cinema on 86th Street where he sat in the balcony, wearing an old sports jacket and open-necked [new-token] .‚ The [new-token] is a bit tight.‚I was in a cafe and I spilt some pop on my [new-token] s.Tie. A word for clothes can be used in both ex- ample. GPT-4o decides the baseline as the win- ner.trouser + shirt Table 11: New examples generated on study examples from the Chimera dataset (Lazaridou et al., 2017) by the Minnow model and the Llama-3 8B baseline it is finetuned from, with greedy decoding. Like in Table 10, the first author provides the judgment and reason (Judgment and Reason). The first example is a reasonable guess of the appearance of an animal. The second example is related to music, but incorrectly infers the word to be a dance or musical form. The last example correctly infers that the word is a kitchen utensil. 23 Page 24: Study Example Sentences Minnow Baseline Word Analysis ‚the first blacksmiths were [new-token] .‚many civilisa- tions were in the area that is now turkey, like the [new-token] , the roman empire and the byzantine empire. ‚spread of hepatoscopy and astrology to [new-token] , etruscans, greeks and romans and to china ‚the first major empire in the area was the [new-token] (from the 18th century to the 13th century bce).1. the [new-token] were a people who lived in the area of turkey. 2. perhaps the most famous and widely used alchemical symbol, first popularized by [new-token] alchemists, is the ouroboros.1. the [new-token] was a major power in the region, and its influence spread to the surrounding areas. 2. epistemol- ogy will suffer the [new-token] syn- drome.hittites The Minnow model learns that hitties is an ancient ethnic group. However, the greedy-decoded example copies the information (turkey) from the study exam- ple, while the sampled exam- ple makes seemingly plausible but factually incorrect gener- alizations (the earliest known ouroboros is found in ancient Egyptian text.) The baseline also generates factual errors (Hittite plague exists but no “hitties syn- drome”). ‚but, i know from his child- hood, he says that his father always buys and [new-token] cars just as entertainment, i guess.‚in the end, yuji ac- cepts shana’s feelings and the two kiss, which activates a spell left to yuji by crimson lord lamia that[new-token] his existence, so yuji is no longer a torch. ‚– a medicine which [new-token] health and energy, as extract of malt, cod liver oil, etc. ‚ that diversifies energy supply, [new-token] ecosystems, pro- tects biodiversity, and sequesters carbon.1. the [new-token] of the [new-token] of the [new-token] of[Infinite Repe- tition] 2. he is a former prime min- ister, who is then [new-token] of his duties.1. the [new-token] is a 3.5 mm jack, which is the stan- dard for most headphones. 2. he [new-token] up to an incredible 2,620 sales per store in 2009, compared with the 60 to 80 that had been typical just a few years earlier.restores For the Minnow model, the greedy-decoded example is non- sense repetition (a common fail- ure of LMs), while the sampled example uses the verb with in- correct syntax (“be restored to”) but pertinent semantics. For the baseline, the greedy-decoded ex- ample also has incorrect syntax (use the word as a noun), and the use in the sampled exam- ple is semantically incorrect (the sales must have reached a simi- lar level in the past). ‚when i was a student [new-token] , i went to the prado in Madrid. ‚ [new-token] is hiking in the outdoors with a backpack full of all your food, clothing, water, and shelter.‚[new-token] may be done on short or long trips.‚[new-token] (wilder- ness)1.[new-token] is a form of wilder- ness travel. 2. a small [new-token] campsite is usually composed of a place to sleep, food, tools and water.1. [new-token] is a type of backpacking. 2. [new-token] is something all peo- ple can do.back- packingTheMinnow model seems to in- fer the meaning of the word, but is rewording the study examples. The baseline also seems to infer the meaning, but its sampled ex- ample is not very informative. Table 12: Error analysis of new examples generated from the BabyLM-10M test portion by the Minnow model and the baseline. In each column of generated examples, the first one is generated by greedy decoding, and the second one by sampling with top-p= 0.92. 24 Page 25: G Evaluation of Generated Definitions As we mentioned in Section 5.4, we use two definition generation datasets: CoLLEGe-DefGen (Teehan et al., 2024) and the Oxford test set (Gadetsky et al., 2018). The original datasets contain 954 and 12,232 words, from which we removed 4 and 2 duplicated words, re- spectively. For CoLLEGe-DefGen, we keep the inflec- tional suffixes, such as “-s”, “-ed”, and “-ly”, after the placeholder so that the placeholder only corresponds to the word stem. This is to remove the influence of morphological inflections. Note that we use our place- holders instead of the <nonce> in the original text of CoLLEGe-DefGen. In addition, we fixed several incor- rect word/phrase replacements in the original dataset (for example, the phrase “ capital gains tax ”). For the Oxford dataset, for simplicity and consistency with pre- vious work, we do not keep the inflectional suffixes but rather replace the whole word with the placeholder. There are 12% examples in the Oxford test set in which we find no occurrences of any form of the word to be learned, but we keep them for consistency with previous work. Additionally, as we also mentioned in Section 5.4, we have additional references of what can be achieved by specialized definition-generation models: the series ofFLAN-T5 (Chung et al., 2024) models finetuned by Giulianelli et al. (2023) specifically on generating definitions. This also follows what Teehan et al. (2024) did. These models were finetuned on three corpora, including the Oxford training set (Gadetsky et al., 2018). The series of finetuned FLAN-T5 are listed on their GitHub page ( https://github.com/ltgoslo/ definition_modeling?tab=readme-ov-file# definition-generation-models-for-english ) and can be accessed through Hugging Face model hub. When evaluating the FLAN-T5 models, a pseudo-word ‘wug’ is used as the placeholder for the new word, like in other baselines (Section 3.4) for a fair comparison. Each FLAN-T5 model is prompted with an example sentence followed by a question, “What is the definition of wug?”, as what Giulianelli et al. (2023) did. Table 13 shows the results of comparing the model- generated and ground-truth definitions from all models, supplementing the brief results in Table 3 with those from the additional specialized FLAN-T5 baselines and the CoLLEGe model. Table 14 shows the average of 1-, 2-, and 3-shot results on the CoLLEGe-DefGen dataset. Tables 15 and 16 show additional definitions generated from the CoLLEGe-DefGen and Oxford test set by the baselines and the Minnow models (in addition to Table 4 in Section 5.4). Results of CoLLEGe (Teehan et al., 2024), which generates new embeddings to be used in an LLM, are ap- pended to Table 13 and 14 and are directly copied from the original paper. Those numbers should be compared with other models with caution because they have dif- ferent settings: They are based on Llama-2 7B (Touvron et al., 2023) and new embeddings, their data processingis not as fine as ours (they did not remove duplicated words from both datasets, did not keep the inflectional suffixes in CoLLEGe-DefGen, and did not find more forms in the Oxford dataset as we do (15% examples without replacing any word form)), and the usage exam- ples they randomly selected from CoLLEGe-DefGen are different from ours. 25 Page 26: Model CoLLEGe-DefGen Oxford Variant Method BERTScore F1 ROUGE-L BERTScore F1 ROUGE-L FLAN-T5 Base +DefInstr baseline 83.1 13.1 84.4 16.5 FLAN-T5 Large +DefInstr baseline 83.8 15.5 84.7 17.4 FLAN-T5 XL +DefInstr baseline 83.1 12.4 84.9 19.4 Llama-3 8Bbaseline 85.1 14.9 83.2 11.0 +Minnow 85.4 18.7 84.7 16.3 Llama-3 8B Instructbaseline 85.3 17.6 83.6 12.5 +Minnow 85.8 20.7 84.7 16.5 Llama-2 7B CoLLEGe* 84.1 18.0 83.6 17.1 Table 13: Quantitative evaluation of generated definitions by comparing them with ground-truth definitions. This table extends Table 3 by presenting results from all models we evaluate, including the additional specialized FLAN-T5 baselines from Giulianelli et al. (2023) and the CoLLEGe model from Teehan et al. (2024). We generate a definition from only one example (1-shot). We sample an example per word from CoLLEGe-DefGen, while Oxford has exactly one example per word. All definitions are generated with greedy decoding. “+DefInstr” means the definition generation finetuning by Giulianelli et al. (2023). “baseline” means using a pseudo-word ‘ wug’ as the placeholder word. For Minnow models (“+ Minnow ”), scores are averaged across 3 runs. The instruction-tuned variant of Llama-3 8B (“Llama-3 8B Instruct”) is better than the pre-trained variant (“ Llama-3 8B ”) on definition generation likely due to its better instruction-following ability. *: CoLLEGe results are from “Prompting + CoLLEGe” in the Table 4 of Teehan et al. (2024), which provides Llama-2 7B (Touvron et al., 2023) with embeddings generated by CoLLEGe and prompt it to generate definitions with in-context usage examples. Teehan et al. (2024) has slightly different data processing, so CoLLEGe results are not strictly comparable (see Appendix G). Model CoLLEGe-DefGen Variant Method BERTScore F1 ROUGE-L Llama-3 8Bbaseline 85.8 17.8 +Minnow 85.9 21.1 Llama-3 8B Instructbaseline 85.9 19.5 +Minnow 86.2 22.6 Llama-2 7B CoLLEGe* 84.8 17.8 Table 14: Quantitative evaluation of generated definitions by comparing them with ground-truth definitions in the CoLLEGe- DefGen dataset. Definitions are generated 1-, 2-, and 3-shot and scores are averaged. All definitions are generated with greedy decoding. For models finetuned with Minnow , scores are averaged across 3 runs. *: CoLLEGe results are from Teehan et al. (2024), which is based on Llama-2 7B and slightly different data processing (see Appendix G). We do not have FLAN-T5 models here since Giulianelli et al. (2023) finetuned them to use only one usage example. 26 Page 27: Example Sentence True Definition Minnow Baseline Word As the hurricane neared, the residents began to [new-token] their windows to protect their homes from the impending storm.to cover or seal windows, doors, or other openings of a building with boards, typically to protect it from damage or unauthorized entry.to protect from harm or dangerto prepare for a hur- ricane by boarding up windowsboard up to make something more secure or safe by cover- ing it with a layer of ma- terialto secure or fasten some- thing, especially a win- dow, to prevent it from being damaged or de- stroyed The gentle hum of the air conditioner provided a [new-token] soundtrack for her midday nap.having a calming or relieving effect, especially in terms of reducing pain or discomfort.a sound that is not loud enough to be hearda small, furry, brown creature that lives in trees.soothing a soothing, calming, or quiet sounda wug is a word that is not yet known to the speaker, but is assumed to be a real word. In their groundbreaking research, the team of geneticists successfully deactivated the [new-token] , resulting in the unexpected bloom of dormant traits within the lab mice.a type of gene that codes for a protein, known as a repressor, which inhibits the expression of one or more genes by binding to the operator or associated silencers.a gene that is turned off in a cella hypothetical new word that does not yet exist in the English language.repressor gene a gene or set of genes that controls the devel- opment of a particular trait or characteristica hypothetical word used in linguistic research to test the ability to form and use new words. She preferred the [new-token] wilderness to the stifling orderliness of city life.not restricted or limited; free; unconfined.not having a definite shape or forma small, furry animal untrammeled a place where there are many trees, especially in a forest or a parka mythical creature that is half-wolf and half- bear In the heart of her rustic kitchen, Grandma carefully seasoned the [new-token] , her secret ingredient for the family’s cherished Sunday stew.The chest portion of a young cow, typically used in cooking for its tender meat.a mixture of herbs and spices used to flavor fooda mythical creature that resembles a cross be- tween a dog and a frog.breast of veal a small, usually round, piece of food, especially a piece of meat or a veg- etable, cut off from a larger piece and cooked separatelya type of meat The contractor recommended [new-token] for insulation due to its excellent thermal resistance and fireproofing properties.a type of insulation material made from melted rock or industrial waste that is spun into a fibrous structure. It is used for thermal insulation, fire protection, and soundproofing in both residential and commercial buildings.a material used to insu- late a buildinga unit of insulation used in constructionmineral wool a material used to pre- vent heat transfer, espe- cially in buildingsa type of insulation mate- rial Table 15: Definitions generated by both the pre-trained and instruction-tuned variant of Llama-3 8B (baselines) and the models finetuned from them with Minnow on BabyLM-10M with greedy decoding, using the prompt “The word [new-token] in the above sentence(s) is defined as "” (so we can extract continuations before the closing quote as the generated definitions). Each definition is generated using the single example sentence shown and provided in context. The pre-trained/instruction-tuned variant is shown at the top/bottom of each row of example. We boldface the winner judged by GPT-4o . You may observe the quality of instruction-tuned variant is better than the pre-trained variant by manually comparing their definition in each example. 27 Page 28: Example Sentence True Definition Minnow Baseline FLAN-XL-DefInstr Word many people in the world have to [new-token] on $ 1 a daysupport oneself to live on something, especially food, in a way that is not very good or healthyto struggle to make ends meetlive on subsist this food [new-token] the patientrestore strength to make something more intense or stronga type of food to give strength revitalizes he is always well- [new-token]care for one’s external appear- ancewell - dressed a type of wug in good health groomed the lawyers tried to [new-token] the credibility of the witnesseschallenge the honesty or verac- ity ofto make something more convincing or believableto question the credibility of a witnessto challenge the hon- esty or veracity ofimpeach the car squeaks to a halt and she glares at him because of his [new-token] stop.characterized by abrupt stops and startsa sudden, sharp, high - pitched sound, espe- cially one made by a car’s brakes or a bird’s calla made-up word a jerk that causes an object to move abruptlyjerky try the full plate pork [new-token] : tender pork, oregano- spiked greek salad, warm puffy pita, rice, and aromatic tzatziki-topped lemon potatoes.a greek dish of pieces of meat grilled on a skewera dish of meat, usually pork, served with a sweet and sour sauce, and often served with rice and vegetablesa type of dish that is a combi- nation of pork, rice, and pota- toes, typically served with a side of salad and pita bread.a greek dish of grilled meat served in a pita .souvlaki extend the tv antenna (word is absent)extend or stretch out to a greater or the full lengtha small, usually round, piece of metal or plastic used to connect two wires togethera type of bird raise or extend verti- callystretch the red light gave the central figure increased emphasis (word is absent)special im- portance or significancea red light a wug is a wug special importance or significanceaccent Table 16: Definitions generated by the instruction-tuned variant of Llama-3 8B (baseline), the Minnow model finetuned from it with greedy decoding, and FLAN-XL-DefInstr (i.e., FLAN-T5 XL +DefInstr baseline), using the prompt “The word [new-token] in the above sentence(s) is defined as "” ( [new-token] can be replaced by other placeholders, as we mentioned in Section 5.4). Each definition is generated using the single example sentence shown and provided in context. The Minnow model generates reasonable definitions given the context, but is often much longer than the ground-truth definitions, likely because it is not fitted to this dataset. The baseline model is often generating low-quality or repetitive definitions, and sometimes sticks to its prior knowledge of the pseudo-word “ wug.” FLAN-XL-DefInstr generates definitions pretty close to the ground-truth, but is sometimes suspicious of overfitting to or memorizing the data, as its definition for ‘impeach’ and ‘accent’ (absent in the example) may suggest. 28 Page 29: H Concepts of “Word” The term “word” can refer to linguistic units with nu- anced variations. Here, we describe the concepts of “word” in different contexts of the paper and their impli- cations. Surprisingly, our models are somehow robust to these variations of “word,” though future work may further improve the processing of words. Word usage datasets In the two datasets we con- structed for training and finetuning (Section 3.2 and Appendix A), a “word” means a word-form, which is instantiated as an individual token extracted from the word-level tokenization (using spaces and punctuations as boundaries). Therefore, for the same lexeme, a sen- tence using one of its word-form is not considered an example of another word-form. For instance, a sentence using other inflected forms of “ ski” like “ Susie likes skiing fast down the snowy mountain on her new skis ” is not included in the example set of “ ski.” Meanwhile, when two word-forms of the same lexeme occur in one sentence, meta-learning one of the word-form could be easier since the other word-form may not be masked. For instance, “ skis” in the sentence “ I saw Susie ski fast down the snowy mountain on her new skis ” could make it easier to guess the word “ ski.” In our work, we focus on learning word-forms, but if we aim to learn a lexeme, this case will reveal the identity of the lexeme we try to mask, undermining our effort on the novelty of the learned word. On the other hand, a word-form in different syntactic categories is considered the same word, and the usage examples will be mixed together regardless of the syntactic categories. Such words are rare, but they introduce syntactic uncertainties in word learning. Syntactic uncertainties are natural, but may increase the difficulty of learning. Pseudo-words In our baselines (Section 3.4 and the FLAN-T5 models in Section 5.4) and comparison of generations (Appendix E), we replace the word to learn by a pseudo-word, like “ dax” or “ wug”, regardless of the word’s syntactic category and other aspects of mean- ing. The pseudo-word is then tokenized, usually by a subword tokenizer for LLMs (thus may have multiple tokens). We choose the pseudo-word to be meaningless and commonly used in linguistic tests. However, a pre- trained LLM like Llama-3 may have priors of certain aspects of the pseudo-word’s meaning based on its form. One aspect of the meaning is syntax. For example, from the sentence “ Susie goes skiing in the winter ”, we re- place “ skiing ” with “ dax” and have the sentence “ Susie goes dax in the winter .” The sentence has a problem: the part of speech of “ skiing ” is gerund, but “ dax” does not look like a gerund (since it does not end in “ -ing”). So the sentence could mislead an LLM like Llama-3 , which can use morphological information from its subword to- kenization. Another aspect of the meaning is semantics. For example, in Table 16, the baseline model sometimes sticks to its prior knowledge of the pseudo-word “ wug,” as reflected in its generated definitions like “ a made-up word ” and “ a type of bird ” (“wug” referred to a bird-likecreature in the Wug Test of Berko, 1958). We admit that this problem may weaken our baselines and comparison of generations. Future work should use more suitable pseudo-words, preserving the morphological inflections while removing the semantic information. Evaluation datasets Words to be learned in the Chimera, CoLLEGe-DefGen, and Oxford datasets are lexemes, so examples of each word use (different) in- flected word-forms. To ensure the placeholder consis- tently represents the same text, we replace only the word stem with the placeholder and retain the inflectional suffixes in the original word-forms on the Chimera and CoLLEGe-DefGen datasets. (We still replace word- forms in Oxford to make our practice consistent with previous ones.) In addition, words to be learned in the CoLLEGe-DefGen dataset also include multiwords or phrases, like the “ categorical imperative ” example in Ta- ble 4. See Appendix G for further details of preprocess- ing. Surprisingly, although our placeholder token repre- sents a word-form in the BabyLM-10M dataset we con- structed, Minnow models finetuned on BabyLM-10M still perform well when using the token to represent a word stem in these datasets. 29

---