Augmenting Language Models with Long-Term Memory

Augmenting Language Models with

Long-Term Memory

Weizhi Wang

†

, Li Dong

‡

, Hao Cheng

‡

, Xiaodong Liu

‡

Xifeng Yan

†

, Jianfeng Gao

‡

, Furu Wei

‡

†

University of California, Santa Barbara

‡

Microsoft Research

[email protected], {lidong1, haocheng}@microsoft.com

Abstract

Existing large language models (LLMs) can only afford ﬁx-sized inputs due to the

input length limit, preventing them from utilizing rich long-context information

from past inputs. To address this, we propose a framework, Language Models

Augmented with Long-Term Memory (

LONGMEM

), which enables LLMs to

memorize long history. We design a novel decoupled network architecture with

the original backbone LLM frozen as a memory encoder and an adaptive residual

side-network as a memory retriever and reader. Such a decoupled memory design

can easily cache and update long-term past contexts for memory retrieval without

suffering from memory staleness. Enhanced with memory-augmented adaptation

training,

LONGMEM

can thus memorize long past context and use long-term

memory for language modeling. The proposed memory retrieval module can

handle unlimited-length context in its memory bank to beneﬁt various downstream

tasks. Typically,

LONGMEM

can enlarge the long-form memory to 65k tokens

and thus cache many-shot extra demonstration examples as long-form memory for

in-context learning. Experiments show that our method outperforms strong long-

context models on ChapterBreak, a challenging long-context modeling benchmark,

and achieves remarkable improvements on memory-augmented in-context learning

over LLMs. The results demonstrate that the proposed method is effective in

helping language models to memorize and utilize long-form contents. Our code is

open-sourced at https://aka.ms/LongMem.

1 Introduction

Large language models (LLMs) have revolutionized natural language processing with great successes

in advancing the state-of-the-art on various understanding and generation tasks [

DCLT19

RWC

LOG

YDY

BMR

RSR

]. Most LLMs beneﬁt from self-supervised training over

large corpora via harvesting knowledge from ﬁx-sized local context, showing emergent abilities,

e.g., zero-shot prompting [

RWC

], in-context learning [

BMR

], and Chain-of-Thought (CoT)

reasoning [

WWS

]. Nevertheless, the input length limit of existing LLMs prevents them from

generalizing to real-world scenarios where the capability of processing long-form information beyond

a ﬁx-sized session is critical, e.g., long horizontal planning.

To address the length limit issue, the most straightforward method is to simply scale up the input con-

text length. For instance, GPT-3 [

BMR

] increases the input length from 1k of GPT-2 [

RWC

]

to 2k tokens for capturing better long-range dependencies. However, this approach typically incurs

computation-intensive training from scratch and the in-context dense attention is still heavily con-

strained by the quadratic computation complexity of Transformer self-attention [

VSP

]. Another

recent line of work [

BPC20

ZGD

] instead focuses on developing in-context sparse attention to

avoid the quadratic cost of self-attention, which still largely requires training from scratch. In contrast,

the prominent work, Memorizing Transformer (MemTRM) [

WRHS22

], approximates in-context

arXiv:2306.07174v1 [cs.CL] 12 Jun 2023

...

Long Sequence Inputs

Large Language Model

(Frozen)

Long-Memory Retrieval

Cached Memory Bank with Key, Value Pairs

Residual SideNet

(Trainable)

Attn Keys

and Values

(Seg A)

Attn Keys

and Values

(Seg B)

Attn Keys

and Values

(Seg Z)

Attention Query

of Current Inputs

Memory Fusion

Retrieved Attn

Keys and Values

Residual

Connections

Figure 1: Overview of the memory caching and retrieval ﬂow of

LONGMEM

. The long text sequence

is split into ﬁx-length segments, then each segment is forwarded through large language models and

the attention key and value vectors of

-th layer are cached into the long-term memory bank. For

future inputs, via attention query-key based retrieval, the top-

attention key-value pairs of long-term

memory are retrieved and fused into language modeling.

sparse attention via dense attention over both in-context tokens and memorized tokens retrieved from

a non-differentiable memory for Transformers. Thus, MemTRM scales up the resulting language

model to handle up to 65k tokens and achieves substantial perplexity gains in modeling full-length

books or long papers. However, MemTRM faces the memory staleness challenge during training

due to its coupled memory design, which uses a single model for encoding memory and fusing

memory for language modeling. In other words, as the model parameters are updated, cached older

representations in memory may have distributional shifts from those from the latest model, thereby

limiting the effectiveness of the memory augmentation.

In this paper, we propose a framework for Language Models Augmented with Long-Term Memory

(

LONGMEM

), which enables language models to cache long-form previous context or knowledge

into the non-differentiable memory bank, and further take advantage of them via a decoupled memory

module to address the memory staleness problem. To achieve decoupled memory, we design a

novel residual side-network (SideNet). Paired attention keys and values of the previous context are

extracted using a frozen backbone LLM into the memory bank. In the memory-augmented layer

of the SideNet, the generated attention query of the current input is used to retrieve cached (keys,

values) of previous contexts from the memory, and the corresponding memory augmentations are

then fused into learned hidden states via a joint-attention mechanism. Furthermore, newly designed

cross-network residual connections between the SideNet and the frozen backbone LLM enable better

knowledge transfer from the pretrained backbone LLM. By continually training the residual SideNet

to retrieve and fuse memory-augmented long-context, the pre-trained LLM can be adapted to leverage

long-contextual memory for improved modeling. The detailed memory cache, retrieval and fusion

process is illustrated in Figure 1.

Our decoupled memory design leads to two main beneﬁts. First, our proposed architecture decouples

the process of encoding previous inputs into memory and the process of memory retrieval and fusion

by decoupled frozen backbone LLM and SideNet. In this way, the backbone LLM only works as

the long-context knowledge encoder, while the residual SideNet works as the memory retriever and

reader, which effectively resolves the issue of memory staleness. Second, directly adapting the entire

LLM with memory augmentations is computationally inefﬁcient, and also suffers from catastrophic

forgetting. As the backbone LLM is frozen during the efﬁcient memory-augmented adaptation stage,

LONGMEM can not only tap into the pretrained knowledge but also avoid catastrophic forgetting.

LONGMEM

is capable of taking various types of long-form text and knowledge into the memory bank

based on downstream tasks. Here, we consider two representative cases, language modeling with

full-length book contexts, and memory-augmented in-context learning with thousands of task-relevant

demonstration examples. Speciﬁcally, we evaluate the effectiveness of the proposed

LONGMEM

on various long-text language modeling, and memory-augmented in-context learning for language

understanding. Experimental results demonstrate that our model consistently outperforms the strong

baselines in terms of long-text modeling and in-context learning abilities. Our method substantially

LLM Decoder Layer

Cache Attn

Keys and Values

LLM Decoder Layer

SideNet Layer

SideNet

MemAug Layer

Embedding Layer

Memory Retrieval and Fusion

Memory

Fusion

Language Model Head

LLM Decoder Layer

Cached Memory Bank with Key, Value Pairs

Token-to-Chunk

Retrieval

Current Inputs

Cached Memory Bank with Key, Value Pairs

Frozen Layer

Trainable Layer

Residual

Figure 2: Overview of

LONGMEM

architecture. “MemAug” represents Memory-Augmented Layer.

improves LLM’s long-context language modeling capabilities by -1.38

∼

-1.62 perplexity over dif-

ferent length splits of Gutenberg-2022 corpus. Remarkably, our model achieves the state-of-the-art

performance of 40.5% identiﬁcation accuracy on ChapterBreak, a challenging long-context modeling

benchmark, signiﬁcantly surpassing existing strong x-former baselines. Lastly, with 2k demonstration

examples in memory,

LONGMEM

shows pronounced in-context learning improvements on popular

NLU tasks, compared with MemTRM and non-memory-augmented baselines.

2 Methods

To enable LLMs to harvest relevant information from the past long context in memory, we propose to

augment the frozen backbone LLM with a decoupled memory module. To fuse the memory context

information, we design a novel lightweight residual SideNet, which can be continually trained in

an efﬁcient way. In the following, we ﬁrst discuss the problem formulation of language modeling

with memory augmentations. Then, we formally introduce our efﬁcient residual SideNet for adapting

the frozen pretrained LLM to jointly attend over local input context and retrieved memory context.

Lastly, we provide our designed processes of how past memory is encoded, stored, recalled and fused

for language modeling.

2.1 Language Models Augmented with Long-Term Memory

Here, we focus on the high-level problem setup and defer more component details to later sections.

Given its wide adoption for pretrained LLMs, our

LONGMEM

model is built on the Transformer

architecture [

VSP

]. For

LONGMEM

, there are three key components: the frozen backbone LLM,

SideNet, and Cache Memory Bank. As most existing pretrained LLMs can only take a ﬁx-sized input,

only the input segment of a long sequence (e.g., a book) that can ﬁt in the length limit is denoted as

the current input as done for most existing autoregressive language models. Those previous segments

that can not ﬁt are denoted as previous inputs, which are used for memory augmentations. To tap into

the learned knowledge of the pretrained LLM, both previous and current inputs are encoded using the

frozen backbone LLM but different representations are extracted. For previous inputs, the key-value

pairs from the Transformer self-attention at

-th layer are stored in Cache Memory Bank, whereas

the hidden states from each LLM decoder layer for the current inputs are retained and transferred to

SideNet. For each current input token, top relevant key-value vector pairs are retrieved as memory

augmentations for language modeling. The SideNet module can be viewed as an efﬁcient adaption

model that is trained to fuse the current input context and relevant cached previous contexts in the

decoupled memory.

Formally, for a ﬁx-sized input text sequence

}

|x|

i=1

(the current input),

LONGMEM

ﬁrst performs a

forward pass using the backbone LLM (marked in Blue in Figure 2) without any gradient calculation.

The embedding layer of the backbone LLM ﬁrst encodes the input

}

|x|

i=1

into embedding space

and outputs the initial hidden states,

LLM

∈ R

|x|×E

, where

is the hidden dimension. Then each

successive Transformer decoder layer of the frozen backbone LLM computes the new hidden states

using the hidden states from the previous layer,

′

LLM

= f

′

LLM

′

−1

LLM

), ∀l

′

∈ [1, L

′

]

and

′

is the

total # layers for the backbone LLM. During the forward pass with the backbone LLM for all previous

inputs, the key-value pairs used for self-attention at the

-th Transformer decoder layer are stored in

Cached Memory Bank (marked in Orange in Upper-Left corner of Figure2), which are later recalled

as memory augmentations for future inputs.

Cached Memory Bank is a cached head-wise vector queue

, Z

∈ R

H×M×d

, which maintains

attention key-value pairs of latest

previous inputs

V ∈ R

H×|x|×d

, where

H , d

denotes the

number of attention heads and per-head dimension respectively. After memory retrieval and fusion

(§2.3), the memory bank removes the key-value pairs of the oldest sequences and appends the current

sequences to the cached vector bank. Thus such an update mechanism ensures the language modeling

causality at the sequences level and enables the memory bank to always keep records of the nearest

previous context for the current inputs.

After the forward pass with the backbone LLM, the SideNet module then takes all current input hidden

states from the backbone LLM

′

LLM

}

′

and the past key-value pairs in Cached Memory Bank for

computing memory-augmented representations. Speciﬁcally, our SideNet of

LONGMEM

consists of

(L − 1)

normal Transformer decoder layers and one special memory-augmented decoder layer. For

efﬁcient purposes, we mainly consider the case where #layers

of the SideNet is smaller than that

of the backbone LLM, i.e.,

L < L

′

. Our SideNet encodes

into memory-augmented contextual

representation via

(L − 1)

normal Transformer decoder layers and a special memory-augmented

layer.

The memory-augmented layer is an extension of the vanilla Transformer decoder layer that takes a

memory-augmented input, including both top relevant key-value pairs in memory and the hidden states

from the current input. Here, the cached key-value pairs are recalled using a token-based memory

retrieval module (§2.3). For each current input token, the memory retrieval module

(:)

retrieves

top-

relevant key-value pairs in the memory bank

{

}

j=1

= s

)

. Then SideNet computes

the output using the memory-augmented input,

Side

= f

Mem

−1

Side

, {{

}

j=1

}

|x|

i=1

)

, where

is the layer index where we inject the memory-augmentation layer.

Finally, the token probability is computed using the last SideNet hidden states

P (x

, ··· , x

i−1

) =

softmax(W H

)

, where

is the frozen output embedding weight shared by both the backbone

LLM and SideNet. We perform a memory-augmented adaptation training for

LONGMEM

to utilize

the decoupled memory. Following the generative unsupervised pre-training [RNSS18], the training

objective of

LONGMEM

is the standard left-to-right language modeling objective, which maximizes

the likelihood of the next token based on the left context:

max

x∈D

|x|

i=1

log P (x

, ··· , x

i−1

where x is a randomly sampled sentence from the pre-training text corpus D.

2.2 Residual SideNet

SideNet Architecture and Initialization. Here, we again implement SideNet based on Trans-

former [

VSP

]. Here, the number of decoder layers

in SideNet is equal to the number of layers

′

in the backbone LLM divided by a reduction factor (a layer reduction factor of

throughout this

work

′

= 2L

). The weights of each decoder layer in SideNet are initialized from the corresponding

pre-trained decoder layer of the backbone LLM with the same depth:

′

Side

= Θ

′

LLM

. As illustrated in

Figure 2, the SideNet takes the output of backbone LLM’s embedding layer and reuses the language

modeling head layer of backbone LLM, which is also frozen during the continual adaption stage.

During the memory-augmented adaptation stage, all other parameters of SideNet are updated accord-

ingly based on the training signal. In this way, the lightweight SideNet achieves fast convergence

with knowledge transferred from pre-trained parameters.

Cross-Network Residual Connections. To tap into knowledge from the pretrained backbone

LLM, we resort to proposed cross-network residual connections for fusing representations from the

backbone LLM into SideNet. Speciﬁcally, we add the difference between output hidden states at

-th and

(2l − 2)

-th layers of the backbone LLM as the residual connections to the output hidden

states at

-th layer of SideNet. Then, the input to the next

(l + 1)

-th layer of SideNet is the sum of

the original hidden state forwarded through the previous layer

Side

l−1

Side

)

and the cross-network

residual connection of the hidden state difference from the backbone LLM

Side

= f

Side

l−1

Side

) + (H

LLM

− H

2l−2

LLM

), ∀l ∈ [1, L], (1)

where

is the output of embedding layer. It is worth noting that the residual connections after the

self-attention and feed-forward network of a decoder layer [

VSP

] will be performed as normal in

Side

l−1

Side

) and parallel to the proposed cross-network residual connections.

2.3 Memory Retrieval and Fusion

The long-term memory capability of

LONGMEM

is achieved via a memory-augmentation module for

retrieval and fusion.

Token-to-Chunk Memory Retrieval. Instead of performing token-to-token retrieval, we focus on

token-to-chunk retrieval for acceleration and integrity. A text-chunk refers to an n-gram structure of

chunk-size

csz

number of contiguous tokens. The memory bank stores cached key-value pairs at the

level of token chunks. We divide the memory bank into

M/csz

attention key-value paired chunks

and use the mean-pooled vector on the chunk-size dimension to get the key vector for retrieval. Then

we retrieve the top-

(K/csz)

attention key-value chunks w.r.t the dot product between the attention

query of the current input token and the mean-pooled attention key of a candidate chunk. Finally,

we squeeze the chunk-size dimension for retrieved key-value paired chunks and ﬂatten them into

key-value pairs at token-level

{

}

j=1

. Adopting token-to-chunk retrieval reduces the size

of the retrieval index and accelerates the process. Meanwhile, the retrieval accuracy can be further

improved, which is also observed in [

LGW

] and [

BMH

]. The hyperparameter chunk-size

csz

controls the granularity of retrieved contexts, which can be empirically adjusted based on downstream

tasks. For instance, in-context learning requires more ﬁne-grained label tokens from demonstration

examples cached in memory, where a smaller csz is helpful.

Memory Fusion. The memory fusion is performed within a special memory-augmented layer.

As the conventional Transformer decoder layer uses the multi-head self-attention [

VSP

], we

follow [

WRHS22

] to extend it to a joint-attention mechanism and propose a long-term memory

fusion process to enable each token to attend on both local contexts and retrieved memory contexts.

With the head-wise hidden state output from previous layer

l−1

∈ R

|x|×d

and the corresponding

retrieved attention key-value pairs are

{

}

|x|

i=1

∈ R

|x|×K×d

, the output hidden state for the

-th

memory-augmented layer H

is computed as:

A = softmax(

√

)V, M = Concat{softmax(

√

)

}

|x|

i=1

, (2)

= sigmoid(g) · A + (1 − sigmoid(g)) · M, (3)

where

Q, K, V, A, M ∈ R

|x|×d

is the number of retrieved attention key-value pairs in cached

memory for each token, and

is a trainable head-wise gating vector. The hidden state output from

previous layer

(l−1)

is linearly projected into attention queries, keys, and values

Q, K, V

separately

via three matrices

, W

∈ R

d×d

. It is worth noting that the retrieved attention key-value

pairs in cached memory are distinct to each token.

3 Experiments

We evaluate our proposed

LONGMEM

model on different tasks based on the demanded in-memory

long-contexts: a) long-text language modeling and language understanding when loading the past

long-context into cached memory; b) inﬁnite-length in-context learning when loading large number

of demonstration examples into cached memory.

3.1 Training Setup

Batchfying the training corpora. The conventional batchyﬁng process for large corpora truncates

the whole corpora into consecutive ﬁx-length text segments without padding and shufﬂes all segments

to construct mini-batches [

RWC

]. In contrast,

LONGMEM

must disable global shufﬂing and

ensure the global causality at segment level. Firstly, we divide all long documents in training corpora

into batch-size number of document groups with equivalent length and then perform a document-level

shufﬂing within each group. Then, we concatenate shufﬂed documents within one group and truncate

them into ordered segments. In order to ensure that two consecutive segments of one long document

are distributed in two consecutive input batches after batchfying, we select one segment from batch-

size number of document groups with the same inner-group index. Thus a mini-batch with batch-size

number of segments are constructed from exactly batch-size number of document groups. In this

way, as the training iteration steps, the cached attention key-value pairs in memory bank are exactly

previous context of current inputs within the same document. The batchfying process is illustrated in

Figure 3.

Document 1

Document 2

Document 3

Document 4

Document Z

Document Grouping Truncated SegmentsLong Documents

Document 5

Cached Memory Update during 5-th Iteration

Batch 1 Batch 2 Batch 3 Batch N

Remove Oldest Append Lastest

Batchfying

Figure 3: Batchfying the large text corpora into batches to ensure that each consecutive segments

within each document is distributed in consecutive batches.

Training Corpus and Hyperparameters. We sample a subset of the Pile [

GBB

] as the training

corpus, including BookCorpus2, Books3, OpenWebText2, Stack Exchange, Wikipedia, Gutenberg

(PG-19), NIH ExPorter, and Pile-CC datasets. We reproduce GPT-2 (407M-params) as the pre-trained

backbone LLM with Alibi [

PSL21

] position embedding because original GPT-2 [

RWC

] adopts

absolute position embedding, which is found to perform poorly to enable LLM to learn long-distance

dependencies [

DYY

]. The backbone LLM holds a

′

= 24, H = 16, d = 64

architecture.

The SideNet holds a

L = 12, H = 16, d = 64

architecture. The training for memory-augmented

adaptation iterates on 26B tokens, with a global 256 batch-size and 1024 sequence length. The

chunk-size

csz

is 4 tokens and the memory size

is 65k key-value pairs of tokens. For each token,

we retrieve

=64 attention key-value pairs for augmentation, which are

K/csz

=16 text chunks. The

memory-augmentation layer is the 9-th layer of SideNet. The attention keys and values from 18-th

layer of backbone LLM is cached into memory and used for future retrieval. Other training details

are presented in Appendix C.

Memory Retrieval Module. The ﬁxed memory-size of cached memory bank in one GPU is 65536

key-value pairs of tokens. We enable each GPU to construct and update their own memory retrieval

module for efﬁciency. For the implementation of the efﬁcient token-to-chunk retrieval, we use the

faiss

[

JDJ21

] toolkit to construct an exact-search index on GPU to store the mean-pooled attention

keys of text chunks and perform efﬁcient retrieval. The

faiss

index maintains a ﬁxed

M/csz

keys

and provides the efﬁcient exact search w.r.t. inner product. The retrieval takes about 15ms per 1k

tokens, which is 55% timecost of backbone LLM forwarding pass. We can easily adapt the exact

search index to approximate search index to gain more the retrieval efﬁciency.

Baselines. In addition to the baseline of pre-trained GPT-2*, we reproduce Memorizing Transformer

(MemTRM) [

WRHS22

] as another memory-augmented adaptation baseline. The MemTRM can be

easily adapted to tune a pre-trained LLM to use external memory. We insert the knn-augmented layer

proposed by MemTRM as the same 18-th layer in the LLM decoder. The MemTRM baseline is also

trained for the same number of tokens under the same hyperparameter setting.

3.2 Long-Context Language Modeling

The long-context language modeling can easily beneﬁt from the augmented decoupled memory of

past long-contexts, in which the knowledge stored in retrieved attention key-values can play a useful

Dataset PG-22

ArXiv

Splits S1 S2 S3 S4 S5

Len. Range 5K-10K 10K-100K 100K-500K 500K-1M >1M <60K

#Documents 500 100 30 8 1 100

Avg. #tokens 7.6K 47.6K 140K 640K 1.2M 15.4K

Table 1: Dataset Statistics of ﬁve splits of PG-22 based on length range and ArXiv.

Model

In-Context In-Memory PG-22

ArXiv

Len. Len. 5K-10K 10K-100K 100K-500K 500K-1M >1M

GPT-2* 1k N/A 22.78 24.39 24.12 24.97 18.07 11.05

MemTRM 1k 65K 21.77 23.56 23.23 24.16 17.39 10.81

LONGMEM 1k 65k 21.29 23.01 22.55 23.35 16.71 10.05

Table 2: Evaluation results on long-context language modeling datasets. We report token-level

perplexity (PPL) (lower the better) on all datasets.

role in providing signiﬁcant background and contextual information to help models perform better on

long-context language modeling. For instance, when trying to model a long-text book accurately,

acquiring knowledge from previous background and character relationships can be helpful to model

the consequent stories.

Evaluation Setting. We ﬁrst compare

LONGMEM

and baselines on 3 long-context modeling datasets,

Project Gutenberg 2020-2022, ArXiv, and ChapterBreak. The majority of included books or papers

in these datasets have the length of at least 16k tokens. All listed datasets are evaluated in zero-shot

manner without any task-speciﬁc tuning. The detailed evaluation settings on 3 datasets are as follows:

•

Project Gutenberg 2020-2022 Language Modeling Dataset. We crawled and cleaned the books

published between 2020 and 2022 under Project Gutenberg Library

to build up a completely new

long-text modeling dataset, named PG-22. It is highly differentiated from our training subset

PG-19 in domains and writing styles, because books in PG-19 [

RPJL19

] are published before

1919. We provide different validation splits of PG-22 based on length range, and data statistics are

presented in Table 1.

•

ArXiv Dataset.

ArXiv

dataset involves papers in the areas of Math, Computer Science, and

Physics. We select a validation split of

ArXiv

paper subset in the Pile corpus [

GBB

ArXiv

subset of the Pile is excluded from our training and is an out-of-distribution dataset. We report the

token-level language modeling perplexity on the long-context language modeling benchmarks of

PG-22 and ArXiv.

•

ChapterBreak Benchmark.

ChapterBreak

is proposed in [

STI22

] as a challenging sufﬁx

identiﬁcation dataset that requires LLMs to distinguish the beginning of the ground-truth next

chapter from a set of hard negative segments sampled from the same book, given the long context

of previous chapters. ChapterBreak requires processing global long-context to comprehend and

identify the correct sufﬁx. [

STI22

] demonstrated that even state-of-the-art x-formers for long-

text processing fail to effectively leverage long-range context to perform well on

ChapterBreak

We select the Archive of Our Own (AO3) subset of

ChapterBreak

which contains fan-ﬁctions

extracted from AO3.

ChapterBreak

provides 8 splits based on the preﬁx length from 0.5k to 8k

tokens to ﬁt the length limit of different models. The splits of 4k, 6k, and 8k preﬁx are selected for

evaluation. For LLMs that cannot process over 4k tokens, we abandon the front preﬁx to fulﬁll the

maximum input length of LLMs. For MemTRM and

LONGMEM

model, we ﬁrstly load the given

4k/6k/8k preﬁx contexts into the cached memory and then do the scoring. we use the perplexity

as the scorer for each candidate sufﬁx segment in zero-shot evaluation manner. Then the sufﬁx

segment with lower perplexity is selected as the label. The sufﬁx identiﬁcation accuracy is used as

the evaluation metric.

Results. The main results on evaluated long-context datasets are summarized in Table 2. The proposed

LONGMEM

model signiﬁcantly outperform all considered baselines on long-text language modeling

https://www.gutenberg.org/

Model #Params

In-Context In-Memory ChapterBreak

ao3

Len. Len. ctx-4k ctx-6k ctx-8k

GPT-2-XL

†

[RWC

19] 1.5B 1K N/A 24% 24% 24%

GPT-3

†

[BMR

20] 175B 2K N/A 28% 28% 28%

LocalTRM

†

[RSVG21] 516M 8K N/A 24% 24% 24%

RoutTRM

†

[RSVG21] 490M 8K N/A 25% 24% 24%

Bigbird

†

[ZGD

20] 128M 4K N/A 26% 26% 26%

GPT-2* 407M 1K N/A 18.4% 18.4% 18.4%

MemTRM 407M 1K ∞ 28.3% 28.7% 28.7%

LONGMEM 558M 1K ∞ 37.7% 39.4% 40.5%

Table 3: Zero-shot Sufﬁx Identiﬁcation Accuracy on AO3 subset of

ChapterBreak

. Baselines

marked with

†

are directly cited from [

STI22

]. The MemTRM and

LONGMEM

loads the given

4k/6k/8k preﬁx contexts into cached memory, while the input length to local context is still 1k tokens.

datasets, with improvements of -1.38 to -1.62 perplexity on different length splits of PG-22, and -1.0

ppl on ARXIV datasets. Surprisingly, the proposed method achieves the state-of-the-art performance

of 40.5% accuracy on

ChapterBreak

AO3

sufﬁx identiﬁcation benchmark and outperforms both the

strong long-context transformers and latest LLM GPT-3 with 313x larger parameters. The substantial

improvements on these datasets demonstrate that

LONGMEM

can comprehend past long-context in

cached memory to well complete the language modeling towards future inputs.

3.3 Memory-Augmented In-Context Learning

LLMs have the emerging capability of in-context learning (ICL) via learning knowledge non-

parametrically from few-shot demonstration examples in the local context. However, conventional

in-context learning is heavily restricted by input context length, rendering it ineffective to absorb

supervision from sufﬁcient demonstration examples in the training set. With the proposed unlimited-

length memory augmentation, our

LONGMEM

method can overcome the limitation of the number

of demonstration examples in the local context and even attend on the whole training set by load-

ing it into the cached memory. In this way,

LONGMEM

goes beyond the conventional few-shot

in-context learning and realized memory-augmented in-context learning with thousands of auxiliary

demonstration examples.

Evaluation Setting. Here, we evaluate the in-context learning capability of baselines and the proposed

LONGMEM

model on ﬁve Natural Language Understanding (NLU) datasets, SST-2 [

SPW

MPQA [

WWC05

], MR [

ABK

], Subj [

PL04

] and SST-5 [

SPW

]. We evaluate models on two

few-shot settings, 4-shot and 20-shot. The 4-shot demonstrations are data-insufﬁcient scenario, while

the 20-shot demonstrations can almost fulﬁll the 1k input length and provide sufﬁcient contextual

self-supervisions. We transform the k-shot examples to semantically meaningful demonstration

examples via ﬁxed text template, i.e.,

="Review:

Sentiment:

∀{(x

, y

)}

i=1

∈ D

train

for

sentiment analysis tasks. Additionally, we evaluate the 3-shot ICL on question-answering tasks of

SQuAD [

RZLL16

] under an open-ended generation setting. The details of all prompt templates are

presented in Appendix D. Then we concatenate the demonstration examples with newlines to delimit

them. The prediction label is directly generated using greedy decoding given the demonstration

examples and test cases in context. The prediction accuracy is used as the evaluation metric. We

report the mean and standard deviation of 6 runs with different random seeds to overcome the

randomness in selecting k-shot demonstration examples. As illustrated before, the chunk size controls

the granularity of retrieved text chunks. As the select NLU datasets require to retrieve ﬁne-grained

labels from cached memory, we perform an hypperparameter selection on the validation set of SST-2,

and the best chunk-size 2 is used to report the results for MemTRM and our model.

Results. The results on in-context learning are summarized in Table 5 and Table 4.

LONGMEM

achieves remarkable improvements on all NLU tasks in 20-shot sufﬁcient in-context set-

ting, with +8.0 average scores increase over pretrained GPT-2* and MemTRM. Meanwhile,

LONGMEM

also brings performance improvements on the scenario of 4-shot demonstra-

tions in local context. Additionally,

LONGMEM

improves the in-context learning capabil-

ities of LLMs on open-ended generation tasks, with +4.5 EM score increase on SQuAD.

Model

In-Context In-Memory SST-2 MR Subj SST-5 MPQA

Avg.

#Demons. #Demons. ACC↑ ACC↑ ACC↑ ACC↑ ACC↑

Majority N/A N/A 50.9 50.0 50.0 20.0 50.0 44.2

GPT-2* 4 N/A 68.3

11.6

64.7

12.5

51.9

4.2

31.4

4.4

61.5

11.8

55.6

MemTRM 4 2000 67.5

12.4

64.6

11.3

53.2

6.0

29.6

4.4

63.0

12.1

55.6

LONGMEM 4 2000 71.8

14.0

65.1

11.0

53.8

3.7

36.0

6.8

65.4

12.8

58.4

GPT-2* 20 N/A 68.2

11.5

63.4

5.2

57.6

10.2

33.6

6.0

70.8

7.6

58.7

MemTRM 20 2000 65.1

9.6

65.1

9.3

58.2

10.6

31.9

6.3

72.7

7.4

58.6

LONGMEM 20 2000 78.0

14.1

78.6

3.3

65.6

8.5

36.5

7.5

74.6

7.3

66.7

Table 5: Accuracy [%] of 4-shot and 20-shot ICL on 5 NLU tasks (SST-2, mr, subj, SST-5, mpqa).

We sample 2000 extra demonstration examples and load them into cached memory. The subscript is

the standard deviation across 6 runs. Avg. refers to the average accuracy on 5 datasets.

Model EM F1

GPT-2* 22.28

2.3

30.78

2.0

MemTRM 22.84

3.5

32.65

2.8

LONGMEM 26.77

2.3

35.70

2.0

Table 4: Exact match (EM) and

F1 scores of 3-shot (about 1k to-

kens) in-context learning on SQuAD.

LONGMEM

loads 200 extra demonstra-

tion examples into cached memory.

The results indicate that the demonstration examples loaded

in cached memory can be regarded as auxiliary contextual

demonstrations to attend to and be helpful for in-context

learning.

LONGMEM

model can harvest both the task-

relevant knowledge in both local contextual demonstrations

and in-memory augmented demonstrations for better in-

context learning.

3.4 Ablation Studies

So far, we empirically verify the effectiveness and superi-

ority of

LONGMEM

in utilizing cached memory for long-

context modeling, long-context understanding, and many-

shot in-context learning. As the design of cached memory

bank involves many hyperparameters like memory size

msz

and chunk-size

csz

, we perform a series

of ablation studies to evaluate the effects of these hyperparameters on task performance.

Effects of Chunk-Size. As analyzed before, the chunk-size

csz

controls the granularity of retrieval

and thus it may make a difference to tasks with requirements of ﬁne-grained retrieval like in-context

learning. We perform an ablation study on the effects of various chunk-size

csz ∈ {2, 4, 8}

in-context learning and the results are presented in Figure 4(a). The chunk size of 2 yields the best

performance on in-context learning tasks on ﬁve NLU datasets, which is consistent with the property

of NLU tasks with the requirement of ﬁne-grained retrieval and fusion towards classiﬁcation label

tokens.

Effects of Memory Size. The memory size (msz) controls the capacity of the memory bank. In

general, the memory size should be compatible with the average length of documents or contexts,

i.e., , a set of books with average 16k tokens should deploy the memory size of 16k tokens in cached

memory. The training

msz

of 65 tokens is excessive for downstream tasks such as ChapterBreak

as the whole preﬁx context length does not exceed 65k tokens. Thus, we perform an ablation study

on the effects of memory size

msz ∈ {8k, 16k, 32k, 65k}

during the inference stage on the PG-22

language modeling datasets and the results are shown in Figure 4(b). To model the books with

average 8k-50k length, the smaller memory size

16k

which is consistent with the average length of

target books yields the best perplexity.

4 Related Work

Large Language Models. Large Language Models, i.e., GPT-2 [

RWC

], GPT-3 [

BMR

OPT [

ZRG

], and BLOOM [

SFA

], signiﬁcantly revolutionized NLP research and promoted

the state-of-the-art of various language understanding, language generation [

WZG

], and even

vision-language tasks [

WDC

]. Additionally, via scaling the model parameters, LLMs exhibit

“emergent abilities“ [

WTB

] like few-shot in-context learning [

BMR

], multi-step reason-

ing [WWS

22], code completion, etc.

72.5

59.8

34.4

71.8

78.6

65.6

36.5

74.6

77.8

75.6

61.1

34.8

72.7

SST-2 MR Subj sst-5 MPQA

Accuracy

csz=1

csz=2

csz=4

(a)

(b)

Figure 4: (a) Accuracy on 5 NLU datasets given different chunk size during inference; (b)

∆

Perplexity

on 4 splits of PG-22 given different memory size during inference, in which the perplexity when

msz=65k is used as baseline.

x-formers. To enable transformers to attend on longer context, many variants of “x-formers“ are pro-

posed. Transformer-XL [

DYY

] proposes to cache attention keys and values of past segment and

reuse them in recurrent manner. Recent seminal works of x-formers, including LinFormer [

WLK

LongFormer [

BPC20

], Routing Transformer [

RSVG21

], proposed various sparse attention mecha-

nisms for decreasing

O(n

)

complexity to

O(n log n)

or even

O(n)

. BigBird [

ZGD

] achieves a

4k sequence length via attending on a subset of context tokens. Although these x-formers achieve

substantial efﬁciency improvements, such efﬁciency gains are not remarkable when modeling se-

quences that spans book-level length. Moreover, the largest sequence length of these methods is

still upper-bounded by 16k tokens, making them invalid in modeling long-sequences at the book or

wikipedia-page level (i.e., average 70k tokens for full-length books in PG19 dataset [RPJL19]).

Side-Tuning. The method of Side-Tuning [

ZSZ

SCB22

] is a task-speciﬁc tuning method for

pre-trained models via training a lightweight side-network that is fused with the ﬁxed pre-trained

network via summation. Our method inherits the idea of adopting a side-network but distinguishes

the side-tuning method in terms of learning objective and cross-network fusion ways.

LONGMEM

proposes to augment LLMs with decoupled memory for memorizing long past inputs, which does not

involve any task-speciﬁc tuning. The cross-network residual connections proposed by

LONGMEM

novel and distincts from the vanilla summation of Side-Tuning.

5 Conclusion

In this paper, we propose to augment LLMs with long-term memory for enabling them to memorize

long-form context and gain long-form memory. The designed decoupled memory module can cache

attention key and value pairs of past inputs for future retrieval and fusion. A decoupled residual

SideNet is introduced as the memory retriever and reader, meanwhile the LLM itself is frozen

and works as knowledge and memory encoder. Experiments on various long-contextual language

modeling datasets demonstrate the effectiveness of our model over other memory-augmentation

baselines. The proposed method can also enable in-context learning of LLMs to overcome the limited

number of demonstration examples in context, which is constrained by the contextual length, via

caching thousands of auxiliary demonstration examples in memory.

References

[ABK

07]

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary

Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer,

2007.

[BMH

21]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie

Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego

de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, T. W. Hennigan, Saffron Huang, Lorenzo

Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol

Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and L. Sifre. Improving

language models by retrieving from trillions of tokens. ArXiv, abs/2112.04426, 2021.

[BMR

20]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel

Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,

Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,

Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,

and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.

[BPC20]

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.

arXiv preprint arXiv:2004.05150, 2020.

[DCLT19]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep

bidirectional transformers for language understanding. In NAACL, 2019.

[DYY

19]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdi-

nov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint

arXiv:1901.02860, 2019.

[GBB

20]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason

Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text

for language modeling. arXiv preprint arXiv:2101.00027, 2020.

[JDJ21]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE

Transactions on Big Data, 7:535–547, 2021.

[KB15]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,

abs/1412.6980, 2015.

[LGW

23]

Rui Lv, Junliang Guo, Rui Wang, Xu Tan, Qi Liu, and Tao Qin. N-gram nearest neighbor machine

translation. arXiv preprint arXiv:2301.12866, 2023.

[LOG

19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike

Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining

approach. ArXiv, abs/1907.11692, 2019.

[PL04] Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summa-

rization based on minimum cuts. arXiv preprint cs/0409058, 2004.

[PSL21]

Oﬁr Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases

enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.

[RNSS18]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language

understanding with unsupervised learning. 2018.

[RPJL19]

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive

transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.

[RSR

20]

Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,

Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer. ArXiv, abs/1910.10683, 2020.

[RSVG21]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efﬁcient content-based

sparse attention with routing transformers. Transactions of the Association for Computational

Linguistics, 9:53–68, 2021.

[RWC

19]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language

models are unsupervised multitask learners. 2019.

[RZLL16]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions

for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250, 2016.

[SCB22]

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Ladder side-tuning for parameter and memory

efﬁcient transfer learning. arXiv preprint arXiv:2206.06522, 2022.

[SFA

22]

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili

c, Daniel Hesslow, Roman

Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-

parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.

[SPW

13]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,

and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment

treebank. In Proceedings of the 2013 conference on empirical methods in natural language

processing, pages 1631–1642, 2013.

[STI22]

Simeng Sun, Katherine Thai, and Mohit Iyyer. Chapterbreak: A challenge dataset for long-range

language models. arXiv preprint arXiv:2204.10878, 2022.

[VSP

17]

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.

[WDC

22]

Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and

Furu Wei. Visually-augmented language modeling. arXiv preprint arXiv:2205.10178, 2022.

[WLK

20]

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention

with linear complexity. arXiv preprint arXiv:2006.04768, 2020.

[WRHS22]

Yuhuai Wu, Markus N. Rabe, DeLesley S. Hutchins, and Christian Szegedy. Memorizing trans-

formers. ArXiv, abs/2203.08913, 2022.

[WTB

22]

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani

Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large

language models. arXiv preprint arXiv:2206.07682, 2022.

[WWC05]

Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions

in language. Language resources and evaluation, 39(2):165–210, 2005.

[WWS

22]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny

Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint

arXiv:2201.11903, 2022.

[WZG

22]

Weizhi Wang, Zhirui Zhang, Junliang Guo, Yinpei Dai, Boxing Chen, and Weihua Luo. Task-

oriented dialogue system as natural language generation. In Proceedings of the 45th International

ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703,

2022.

[YDY

19]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V.

Le. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.

[ZGD

20]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago

Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for

longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.

[ZRG

22]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo-

pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer

language models. arXiv preprint arXiv:2205.01068, 2022.

[ZSZ

20]

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning:

a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020:

16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages

698–714. Springer, 2020.

A Inference Efﬁciency and GPU-Memory Efﬁciency

When the model is required to comprehend long sequences, the proposed method

LONGMEM

can

load the out-of-boundary inputs into the cached memory as previous context. Thus, the memory

usage and inference speed can be signiﬁcantly improved compared with vanilla self-attention-based

models. The detailed statistics in terms of the efﬁciency is presented in Table 6.

Model

In-Context In-Memory Inference Speed GPU-Memory Usage

Len. Len. (tokens/s)↑ (MBs)↓

GPT-2* 4k N/A 14666 20671

LONGMEM 1k 3k 22638 13335

GPT-2* 8k N/A 8417 54195

LONGMEM 1k 7k 21343 13437

Table 6: The superiority of our method over fully dense self-attention (GPT-2*) in terms of inference

speed and GPU-memory utilization.

B Training Details

The pre-training of reproduced GPT-2* iterates on 117B tokens in total, with 512 batch-size and

1024-token ﬁxed segment-length. The Adam optimizer [

KB15

] is adopted in memory-augmented

adaptation training. The pre-training and adaptation are trained on 16 32GB-Tesla-V100 GPUs. Other

detailed training hypperparamters and settings are presented in Table 7.

Hyperparameter LONGMEM

Reproduced GPT-2* Backbone LLM Hyperparameters

Parameters 407M

Precision float16

Layers 24

Hidden dim. 1024

Attention heads 16

Head Dim 64

Vocab size 52k

Sequence length 1024

Position emb. Alibi

Tied embedding False

SideNet Hyperparameters

Parameters 151M

Precision float16

Layers 12

Hidden dim. 1024

Attention heads 16

Head Dim 64

Sequence length 1024

Memory-Augmented Adaptation Hyperparameters

Global Batch Size 256

Learning rate 2.0e-4

Total tokens 26B

Warmup tokens 0

LR Decay style polynomial

Adam (β

, β

) (0.9, 0.98)

Adam eps 1e-06

Weight decay 0.01

Gradient clipping 2.0

Table 7: Memory-Augmented Adaptation and Architectural Hyperparameters.

C Prompting Templates

We present all hand-crafted in-context learning prompting templates and labels for 5 NLU datasets

and Squad QA dataset in Tabel 8.

Task Prompt Labels

SST-2 Review: [Sentence] Sentiment: [Label] {positive, negative}

MR Review: [Sentence] Sentiment: [Label] {positive, negative}

MPQA Review: [Sentence] Sentiment: [Label] {positive, negative}

SST-5 input: [Sentence] type: [Label] {terrible,bad,okay,good,great}

Subj input: [Sentence] type: [Label] {objective, subjective}

Squad Passage: [Passage]\n Question: [Question] Answer: [Answer]

Table 8: The hand-crafted prompts used to query the model predictions on the zero-shot evaluation of

5 NLU datasets and one question-answering dataset Squad.