A Guide for
Large Language Model
Make-or-Buy Strategies:
Business and Technical
Insights
August 2023
appliedAI
2
Contents
Contents 2
Executive Summary 4
1. Introduction 6
2. To Make or To Buy: Leveraging Large Language Models in Business 8
2.1. Getting Prepared for Large Language Model Make-or-Buy Decisions 8
2.1.1. Understanding the Large Language Model Tech Stack 8
2.1.2.Understanding Key Factors in Large Language Model Make-or-Buy Decisions 9
2.1.3. Understanding (Dis-)advantages of Open- vs. Closed-source Large Language
Models 12
2.1.4. Understanding (Dis-)advantages of Fine-tuning vs. Pre-training Models from
Scratch 13
2.2. Approaches for Large Language Model Make-or-Buy Decisions 15
3. Critical Techniques and Trends in the Field of Large Language
Models: From Landscape to Domain-specic Applications 18
LLM Strategy Guide
3
3.1. Navigating the Landscape of Large Language Models in the Generative AI Era 18
3.1.1. Key Techniques, Architectures, and Types of Data 18
3.1.2, Major Closed-source Models and Open-source Alternatives 26
3.1.3. Flourishing Large Language Model Applications, Extensions, and Relevant
Frameworks 31
3.2. Domain-Specic Application of Large Language Models in Industrial Scenarios 33
3.2.1. Fine-tuning and Adaptation from a Technical Perspective:
To What Extent Are They Needed and How Could They Help? 33
3.2.2. Towards Domain-specic Dynamic Benchmarking Approaches 35
References 38
Authors 44
Contributors 45
About appliedAI Initiative GmbH 46
Acknowledgement 47
appliedAI
4
Executive Summary
Key Business Highlights:
Rational Approaches to Large Language Model Make-or-Buy
Decisions
F
irms that employ large language models
(LLMs) can create signicant value
and achieve sustainable competitive
advantage. However, the decision of whether
to make-or-buy LLMs is a complex one and
should be informed by consideration of
strategic value, customization, intellectual
property, security, costs, talent, legal
expertise, data, and trustworthiness. It is also
necessary to thoroughly evaluate available
open-source and closed-source LLM options,
and to understand the advantages and
disadvantages of ne-tuning existing models
versus pre-training models from scratch.
D
epending on the strategic value and
the degree of customization needed,
rms have six possible approaches to
consider when making LLM make-or-buy
decisions:
1) Buy end-to-end application without LLM
controllability
2) Buy an application with limitedly
controllable LLM – Procure the application
including LLM as a component with some
transparency and control
3) Make application, buy controllable LLM
Internal development of application on
top of procured LLMs controllable via APIs
4) Make application, ne-tune LLM – Internal
development of application and ne-
tuning of LLM based on procured or open-
source pre-trained LLMs
5) Make application, pre-train LLM – Internal
development of application and pre-
training of LLM from scratch
6) Stop
LLM Strategy Guide
5
Key Technical Highlights:
Future-shaping Trends for Informed Make-or-Buy Decisions
B
eyond fundamental LLM techniques
such as the transformer model
architecture, pre-training, and
instruction tuning, there are important
emerging trends that will further enhance
LLM performance and adaptability in
widespread domain-specic tasks. These
include the development of more efcient
model architectures and dataset designs,
integration of memory mechanisms inspired
by cognitive science, incorporation of
multimodality, enhancements in factuality,
and improved reasoning capabilities for
autonomous task completion.
N
ew possibilities to strike balances
between open- and closed-source
models, and between large and
small language models, present promising
opportunities. A growing open-source
ecosystem is helping organizations to
optimize costs and achieve the best
outcomes by leveraging the strengths
of each type of model. Likewise, smaller
language models have demonstrated
efcacy in specic tasks, challenging the
notion that bigger models are always
superior. Embracing this diverse range of
models can promote more efcient and
effective language model implementation.
G
aining a comprehensive
understanding of these trends
is vital for rms wanting to make
well-informed decisions and avoid
misconceptions about LLMs when planning
long-term budgets and infrastructure design.
appliedAI
6
1. Introduction
At the start of this decade, the concept
of generative AI was known only to a few
enthusiasts and visionaries. Yet in just a few
years, it has become increasingly evident that
generative AI, and particularly techniques
related to Large Language Models (LLMs),
are to be a game-changer for individuals,
businesses, and wider society.
Generative AI and the latest class of
generative AI systems, driven by LLMs such
as GPT-4, PaLM-2, and Llama 2, are capable
of creating original content by learning from
vast datasets. These ‘foundation models’
generalize knowledge from massive amounts
of data and can be customized for a wide
range of use cases. Some use cases require
minimal ne-tuning and a lower volume of
data, while others can be solved by providing
just a task instruction with no examples
(termed zero-shot learning) or a small
number of examples (few-shot learning).
These opportunities are empowering
developers to build AI applications that were
previously impossible and which have the
potential to transform industries.
The signicance of generative AI and
LLMs cannot be overstated. By enabling
the automation of many tasks that could
previously only be performed by humans,
generative AI will signicantly increase
efciency and productivity across entire
value chains and corporate functions,
reducing costs and opening up new and
exciting opportunities for growth. A study
by McKinsey, for example, estimates that
generative AI could add between $2.6
trillion and $4.4 trillion of value to the global
economy annually and automate work
activities that currently account for 60-70% of
employees’ time
1
. Firms that do not embrace
AI are at risk of falling behind.
1 McKinsey and Company (2023). The economic potential of generative AI: The next productivity frontier. https://
www.mckinsey.com/capabilities/mckinsey-digital/our-insights/The-economic-potential-of-generative-AI-The-
next-productivity-frontier#business-and-society
With the disruptive and extremely fast-paced
acceleration of AI advancement, executives
are confronted with some pressing
questions: What value do generative AI, and
in particular LLMs, have for my business?
How can I utilize the benets of LLMs? What
are the risks of embedding LLMs into my
organization? And what are LLMs, anyway?
Indeed, it is becoming vital to understand
how to effectively leverage this technology
in products, services, corporate functions
and processes, and how to apply LLMs to use
cases where signicant added value can be
achieved.
This white paper seeks to guide readers
on how to navigate this new era of LLMs,
enabling rms to make rational, informed
decisions and achieve sustainable
competitive advantage. It is essential to
understand both the business and technical
aspects of incorporating LLMs into your
organization. As such, we here address both
aspects by rst discussing make-or-buy
decisions around the application of LLMs
from a business perspective, followed by an
overview of critical technical topics, including
the latest trends in the eld and domain-
specic industrial applications of LLMs.
Whatever your company's stage of AI
maturity, now is the time to leverage LLMs
and drive innovation further.
LLM Strategy Guide
7
Glossary
Generative AI
A eld of articial intelligence that
focuses on creating models capable of
generating novel content, such as text,
code, images, or music, that resembles
human-created content.
Foundation Model
A large neural network model that
captures and generalizes knowledge
from massive data. A starting point
for further customization and a
fundamental building block for specic
downstream tasks.
Large Language Model (LLM)
A powerful neural network algorithm
designed to understand and generate
human-like language, typically trained
on a vast amount of text data and
considered a type of foundation model.
[See later Info Box ‘Large language
models as foundation models’].
Transformers
A type of neural network architecture
that has revolutionized natural language
processing tasks by efciently capturing
long-range dependencies in sequential
data such as sentences or paragraphs,
making it a suitable building block for
large language models.
Pre-training
The initial phase of training a neural
network model. The model learns from
a large dataset, allowing it to capture
general knowledge and patterns.
Fine-tuning
The process of adapting a pre-trained
neural network model to perform
specic tasks by training it on task-
specic data. This allows the model to
specialize its knowledge and improve its
performance on specic applications.
Few-shot learning
A technique whereby an AI model learns
to perform a new task with a small
number of examples, making it possible
to teach the model something new
without needing much training data.
Zero-shot learning
A technique whereby an AI model can
understand and perform a task with
no specic examples or training on
that task, relying instead on general
knowledge it has learned from related
tasks.
Q
How do you view the impact of the recent trend of generative
AI?
A
“Strategically, this has changed the way we work and what our
focus areas are. The output quality and ease of use will shape
both our professional and our private lives.”
- Dr. Andreas Liebl, Managing Director and Founder, appliedAI Initiative GmbH
appliedAI
8
2. To Make or To Buy:
Leveraging
Large Language Models
in Business
Effectively utilizing LLMs in business requires
consideration of several factors that will
affect decisions to either leverage external
closed-source models via APIs, develop
LLMs in-house, or take some form of
intermediary approach. There is no clear-
cut answer to how to make these decisions
but a systematic approach requires taking
into account LLMs and their applications
and informing make-or-buy decisions
by expanding from a sole application
perspective to one that encompasses LLMs.
To achieve this, the rst step is to assess
which capabilities and internal resources
are available and, in turn, which tech stack
should be addressed. The LLM tech stack is
generally understood to consist of four layers
as presented in Figure 1.
The bottom layer is the infrastructure
required (such as necessary hardware or
cloud platforms). This includes the systems
and processes needed to develop, train,
and run LLMs, such as high-performance
computation (HPC) optimized for AI and
Deep Learning. Anticipated use cases
and their scalability inuence the overall
infrastructure decision.
The second layer is the data volume and
quality required. The amount of data needed
strongly depends on approaches to use and
customization of LLMs (e.g., pre-training vs.
ne-tuning), so data quality and data curation
are always crucial for LLM success. Firms can
invest in data curation and preprocessing
techniques such as data cleaning,
normalization, and augmentation, to enhance
data quality and consistency. Implementing
rigorous quality control measures during the
data collection and labeling process can also
improve data reliability.
1. Infrastructure
2. Data
3. LLM
4.
LLM
Application
Figure 1. The tech stack for large language models
2.1. Getting Prepared for Large Language Model Make-or-Buy
Decisions
2.1.1. Understanding the Large Language Model Tech Stack
LLM Strategy Guide
9
Besides the LLM tech stack, there are other
factors that should considered in make-
or-buy decisions for LLMs, including the
following:
1) Strategic value. Ensuring that the
deployment of LLMs is in line with the
overall corporate strategy is of utmost
importance in make-or-buy decisions.
The main reason for developing an
LLM in-house is that it can provide high
strategic value with high scalability and
value creation, enabling a rm to achieve
sustainable competitive advantage. By
building LLMs internally, organizations
can establish and maintain proprietary
knowledge and in-house expertise,
creating an intellectual asset. This
intellectual property can contribute
to long-term competitive advantage
as it becomes increasingly difcult for
competitors to replicate or imitate.
Competitive advantage can also be
achieved through LLM ne-tuning,
depending on the quality and value of the
training data. As ne-tuning approaches
are relatively inexpensive, this presents
a promising value-creation opportunity
for rms with data assets. In contrast,
when LLMs are developed and trained
externally, they are available to a wider
market and available to competitors,
meaning no sustainable competitive
advantage can be achieved. Moreover,
having in-house LLM development
capabilities fosters innovation and a
culture of continuous learning in that it
enables rms to stay at the forefront of
technological advancements.
2) Customization. Developing LLMs in-house
typically allows for greater customization,
meaning that LLMs can be tailored to
requirements and rm-specic use
cases. This point mostly holds for ne-
training models with unique internal
data. In comparison to off-the-shelf
products, customized LLMs allow for
greater exibility while also maintaining
full ownership (cf. Chapter 3.2 “Domain-
Specic Application of Large Language
Models in Industrial Scenarios for more
technical information). While using
external non-customized LLMs will mean
lower costs, it is important to note that
potentially sensitive data must be shared
with the external partner.
3) Intellectual property (IP). LLMs, especially
those sourced from the external market,
are trained on extensive datasets that
may include copyrighted materials or
proprietary information. As a result, there
may be concerns regarding ownership
and usage rights of generated content.
Firms must therefore establish clear
policies and agreements that address
IP rights concerning LLM-generated
2.1.2. Understanding Key Factors in Large Language Model Make-or-Buy
Decisions
On the third layer is the LLM, which will
eventually form the basis for idiosyncratic
applications. LLMs can be open- or closed-
source (cf. Chapter 2.1.3. and Chapter
3.1.2.). Firms should aim to create synergies
between value-adding use cases as part of a
systematic make-or-buy strategy.
The fourth and top layer is LLM applications.
These applications can either build upon
end-to-end applications or rely on an
external third-party API. The make-or-buy
decision for the application layer depends on
the specics of the lower layers. For example,
if a rm lacks high-quality data, then “make” is
unlikely to be a feasible option here.
appliedAI
10
content. These policies should outline
ownership of content, licensing or usage
restrictions, and provisions for protecting
sensitive information. Collaborative efforts
involving third parties should ensure
that these issues are considered during
contracting. It should be noted, however,
that there is still a great deal of uncertainty
around IP rights stemming from content
created through generative AI.
4) Security. LLMs can require the
processing of extremely sensitive
business information. Firms should
conduct a thorough risk assessment
for each use case to ex-ante identify
and address potential security issues.
For highly sensitive data it is typically
recommended to host the LLM within a
rm insular network. If this is not possible,
collaborating with reputable external LLM
providers who adhere to stringent security
standards and are transparent about
their security practices is crucial. For data
falling under the GDPR, rms must ensure
that all data is stored and processed on
servers within Europe.
5) Costs. Developing LLMs in-house is a
costly endeavor. It rst requires signicant
investment in terms of hiring a highly-
skilled workforce, including ML engineers
and NLP specialists, who tend to
command high salaries. The development
process itself is then time-consuming and
resource-intensive, involving extensive
research, data collection, model training,
and iterative improvement cycles,
all of which demand considerable
computing power and infrastructure
investment. Ongoing maintenance,
updates, licenses, and support require
continuous investment to ensure optimal
performance and reliability. Last, it is
important to consider the opportunity
costs of allocating internal resources to
LLM development over core business
activities. While in-house development
offers several benets, it diverts attention
and resources from other strategic
initiatives and potentially delays time-
to-market, which can lead to increased
opportunity costs. Executives should
therefore carefully evaluate nancial
implications and weigh costs against
potential benets before deciding to
develop LLMs in-house. Fine-tuning may
be a more suitable approach in many
cases, with substantially lower costs.
To address high development costs,
organizations could explore ways to
streamline the labeling and development
cycles. Leveraging pre-existing labeled
datasets or partnering with external
data providers can reduce the need for
extensive manual labeling, saving time
and resources. Additionally, adopting
cloud-based solutions for data storage
and processing can offer scalability and
cost-efciency, enabling organizations
to handle large volumes of data more
effectively.
6) Talent. The scarcity of experienced
professionals in elds such as data
science, ML, and NLP often make it
difcult to establish a skilled in-house
team, especially for SMEs confronted
with resource constraints. In Europe,
the competition for top talent is erce,
with SMEs and large rms alike facing
recruitment difculties and talent
shortage. Additionally, extremely
rapid development in the eld of LLMs
necessitates continuous learning and
professional development, meaning
companies should make signicant
investments in training and upskilling their
workforce. Overcoming these hurdles
requires a strategic approach that can
include fostering partnerships with
academic institutions, collaborating with
external partners, offering competitive
salaries, and creating a stimulating work
environment that promotes innovation.
Firms already confronted with talent
scarcity may decide to source their LLM
solutions from the market to save direct
and indirect talent-related costs and to
utilize their talent resources for other
projects. In-house ne-tuning models
often constitute a middle course that
can strike a balance between acquiring
off-the-shelf products and developing
models from scratch.
7) Legal expertise. Developing LLMs in-house
requires rms to seek legal expertise
to navigate an increasingly complex
regulatory landscape. For instance,
the proposed EU AI Act, which focuses
on preventing harm to health, safety,
and fundamental human rights, would
involve a risk-based approach whereby AI
systems would be assigned to a risk class.
High-risk systems such as LLMs would
need to meet stricter requirements than
low-risk systems. Firms pursuing in-house
LLM Strategy Guide
11
development of LLMs must ensure they
follow all regulatory requirements and
thus obtain increasingly complex legal
expertise. If this is not available in-house,
or if rms want to reduce their general
liability, they may instead decide to buy
an LLM from the market and ensure the
provider is fully liable, i.e., that the specic
use case is in line with applicable laws and
regulations. Additionally, by considering
risk classication early in the decision-
making process and making timely
decisions, rms can avoid unnecessary
expenditures and undesired legal
consequences.
8) Data. Data is of utmost importance for LLM
performance. LLMs rely on vast amounts
of diverse data to understand language
patterns, enhance accuracy, and generate
coherent and appropriate responses.
However, biases inherent to the data
can pose challenges. For example, LLMs
might inadvertently learn and perpetuate
biases present in training data. Efforts
are being made to identify and mitigate
such biases. Diverse and inclusive training
data is crucial to ensure fairness and
reduce perpetuation or amplication of
existing biases, and regular monitoring
and user feedback are vital for detecting
and rectifying biases. By evaluating LLM
outputs and actively seeking user input,
developers can improve systems’ fairness
and mitigate biases. Data is equally
important for the process of ne-tuning
LLMs. By ne-tuning with domain-specic
data, LLMs can acquire specialized
knowledge and language patterns related
to the target task, enabling them to
generate responses that align with the
specic requirements of the use case.
Moreover, ne-tuning also helps address
biases and improve fairness in LLM
responses. By ne-tuning with datasets
that are explicitly designed to be diverse,
inclusive, and representative, developers
can reduce biases and ensure that the
LLM performs more equitably.
Q
What, in your opinion, is the most critical challenge or risk that
the European industry needs to address when adopting LLMs
for practical use cases?
A
Among the most critical challenges for the industry when
adopting LLMs is the alignment with existing and upcoming
regulations, such as the EU AI Act. At the same time, this challenge is
also an opportunity to honor our customers' trust in their data with
our own standards and approach, and to get them on board with
the change. This alignment includes meeting data management
requirements, model evaluation, testing, monitoring, disclosure
of computational and energy requirements, and downstream
documentation. In terms of data privacy, companies from Europe
need to be cautious about sharing sensitive data with LLMs hosted
by foreign entities and comply with GDPR regulations. To address this
challenge, potential mitigation measures include developing robust
data anonymization techniques, implementing secure and private
computing methods, encouraging local LLM development to reduce
reliance on foreign models, and working with regulators to establish
clear guidelines and frameworks for the responsible use of AI.”
- Dr. Stephan Meyer, Head of Articial Intelligence, Munich Re Group
appliedAI
12
Make-or-buy decisions regarding LLMs
require thorough evaluation of available
options, which include open-source and
closed-source LLMs. Generally, the current
market environment is dominated by closed-
source, API-based LLMs, yet there is an ever-
growing number of open-source options. The
gure below provides an overview of notable
open- and closed-source LLMs released
between 2019 and June 2023 [1].
As Figure 2 shows, there is a wide range of
options for open-source and closed-source
LLMs
1
. Available open-source options tend to
allow for greater transparency and auditability
over their proprietary counterparts. With
open-source models, researchers and
developers can access the underlying
code, model architecture, and training data,
such that they can understand the inner
workings of the model and identify potential
biases or ethical concerns. Indeed, whereas
transparency is a crucial aspect of open-
source LLMs, closed-source LLMs are most
often a black box with opaque underlying
functioning. When a model's code and data
are made openly available, developers can
scrutinize and verify its behavior, ensuring
it aligns with desired ethical standards.
1 See also Chapter 3.1. for a more comprehensive analysis as well as detailed lists of available options from a
technical perspective, in particular for the trend of maximizing the benets by incorporating both large closed-
source LLMs and a combination of large and small, specialized open-source LLMs.
This transparency can also help to address
concerns about algorithmic biases and
discriminatory outputs. Researchers and
the wider community can work together to
identify and rectify these issues, leading to
fairer, more trustworthy language models.
Several prominent open-source LLM
initiatives have emerged, each making
signicant contributions to the eld. As well as
early versions of OpenAI's GPT (Generative
Pre-trained Transformer), an inuential
open-source LLM initiative is Hugging Face's
Transformers library, which provides a
comprehensive set of pre-trained models
including various architectures such as GPT,
BERT, and RoBERTa. The library also offers
tools and utilities for training, ne-tuning,
and deploying models, making it easier for
developers to leverage the power of LLMs
in their applications. The Transformers
library has gained widespread popularity
due to its user-friendly interface, extensive
documentation, and support from a vibrant
community. Several other open-source LLM
projects and libraries exist, such as Fairseq,
Tensor2Tensor, and AllenNLP.
2.1.3. Understanding (Dis-)advantages of Open- vs. Closed-source Large
Language Models
9) Trustworthiness. Trustworthiness is of
paramount importance when employing
LLMs. In-house development of LLMs
allows rms to have full control over the
entire process, enabling them to build
LLMs in line with their values and ethical
considerations. This control fosters
trustworthiness by ensuring that LLMs
are aligned with rms’ mission and
vision. Moreover, in-house development
enables transparency and explainability.
Firms can document and communicate
development methodologies, data
sources, and training processes, allowing
users to better understand and evaluate
LLM outputs. By mitigating biases and
ensuring fairness, rms can build trust
among users, assuring them that the
LLMs provide accurate and unbiased
information. Alternatively, when buying
LLMs from the market, especially from
established suppliers, rms may benet
from the fact that the acquired LLM has
undergone rigorous testing, evaluation,
and compliance checks to ensure it
meets industry standards and regulatory
requirements. Again, the ne-tuning of
models often constitutes a compromise
between trustworthiness and effort.
Together, these factors should be viewed
holistically and acted on as such, rather than
being addressed in isolation.
LLM Strategy Guide
13
Figure 2. Open-source and closed-source large language models with over 10 billion parameters
released between 2019 and June 2023 [1]
In turn, closed-source LLMs often leverage
signicant computational resources and
proprietary datasets during their training,
allowing them to perform at extremely
high levels on a range of language tasks.
The investment in infrastructure and data
acquisition made by companies can result in
LLMs that surpass the capabilities of open-
source models. However, rms are especially
concerned about data protection and
information security when closed-source
LLMs are running as software as a service
(API-based model), an approach increasingly
used by vendors. Customization of closed-
source models means that rms need to
transfer their often highly sensitive data to
the vendor for ne-tuning.
2.1.4. Understanding (Dis-)advantages of Fine-tuning vs. Pre-training Models
from Scratch
Another critical aspect in make-or-buy
decisions regarding LLMs relates to an in-
depth understanding of the advantages and
disadvantages of ne-tuning existing models
versus pre-training models from scratch,
specically considered from a business
perspective.
Fine-tuning pre-trained LLMs generally
incurs signicantly lower costs compared
to building them from scratch. Depending
on the underlying data structure and
volume, ne-tuning costs can be relatively
low, ranging from a few hundred to a few
thousand US dollars. In ne-tuning, a pre-
trained model is already available, eliminating
the need for resource-intensive pre-training
on vast amounts of data and large amounts
of computational power. This translates to
signicant savings in resources, time, and
electricity consumption.
Conversely, pre-training LLMs from scratch
involves substantial costs at various stages
of the process which combined can reach
millions of dollars. For example, the training
costs for OpenAI’s GPT-3 are estimated
to be $5 million, while models with more
appliedAI
14
training parameters are estimated to
exceed these costs. Pre-training LLMs from
scratch demands an enormous amount of
computational power, specialized hardware,
and extensive infrastructure, all of which add
heavy costs. Another consideration is that the
pre-training process can take weeks or even
months to complete, adding to the costs of
computational resources and electricity.
There are also notable differences in data
acquisition and annotation costs. Fine-
tuning LLMs typically requires a smaller
labeled dataset for the target task, which
can be less expensive to obtain, annotate,
and curate than the comprehensive and
diverse datasets required for pre-training
an LLM from scratch. The costs of acquiring
and labeling a large-scale dataset can be
substantial, and manipulation of such assets
requires substantial domain expertise and
signicant human effort.
Overall, then, there are usually cost
advantages to ne-tuning LLMs compared
to pre-training them from scratch. However,
it is essential to consider the specic
requirements of each use case, including
the scale of the target task, availability
of data, and potential risks, to determine
the most appropriate approach based on
available resources and objectives. Ultimately,
decisions about this question will depend on
the business cases and nancial resources
a rm is willing to invest. See also Chapter
3.2. Domain-Specic Application of Large
Language Models in Industrial Scenarios
for relevant discussions from a technical
perspective.
LLM Strategy Guide
15
2.2. Approaches for Large Language Model Make-or-Buy Decisions
After acknowledging the LLM tech stack
and relevant key factors and business
considerations, there are six generic
approaches that rms can follow when
making LLM make-or-buy decisions:
1) Buy end-to-end application without LLM
controllability
When evaluating use cases of low
strategic value and limited customization
requirements for both the application
and the LLM, acquiring a pre-built
end-to-end application is typically the
most convenient solution, with the LLM
operating merely as a hidden component.
Given the highly tailored nature of the LLM
to the application and its scope, explicit
customization and controllability are
unnecessary and likely not allowed by the
vendor.
2) Buy an application with limitedly
controllable LLM – Procure the application
including the LLM as a component with
some transparency and control
This approach of procuring an application
along with controllable LLMs applies to use
cases that demand minimal adjustments
or can be deployed immediately. It is
worth noting that in scenarios where
customization needs are low, it may be
less necessary to control the underlying
LLM and companies might instead focus
on adapting only the user layer to meet
their requirements. Nevertheless, case-
specic requirements concerning the
degree of customization, regulation, data
security/secrecy, intellectual property
(IP) concerns, and overall performance
should be carefully considered. Another
point of attention is the reusability of an
LLM across applications in the company
and how this might produce undesired
dependencies and vendor-locking
scenarios. This approach is only feasible in
cases of low data condentiality allowing
transfer to external providers.
3) Make application, buy controllable LLM
– Internal development of application on
top of procured LLMs via APIs, e.g., Azure
OpenAI Services
An alternative to the above approach
is to focus exclusively on the internal
development of the application while
sourcing and integrating externally
sourced pre-trained or ne-tuned LLMs.
This approach is particularly suitable
for use cases that demand medium to
high levels of LLM customization and
is especially relevant when internal
resources such as computing power,
capacity, or skills are not sufciently
available. Additionally, budget constraints
can also drive the decision to adopt this
strategy. However, as with approach 2,
considerations regarding customization,
regulation, data security/secrecy, and IP,
as well as overall performance and model
reusability, need to be carefully taken into
account, and vendors should be carefully
scrutinized.
4) Make application, ne-tune LLM – Internal
development of application and ne-tuning
of LLM based on procured or open-source
pre-trained LLMs
This approach involves utilizing existing
pre-trained LLM models, along with
specic ne-tuning frameworks or
services, and combining them with
internal development efforts to build
applications and ne-tune models using
internal data for targeted use cases. The
quantity and quality of open-source pre-
trained LLMs are continuously rising, but
the licenses of these pre-trained models
can impose signicant limitations on
their commercial use. For ne-tuning,
several providers such as AWS, Google,
NVIDIA, H2O, and others already offer such
services, and various open-source ne-
tuning services are already available. The
level of internal development required
depends on both the sophistication of the
ne-tuning components and the quality
of the underlying pre-trained LLM, as
well as the availability of in-house data.
While ne-tuning models is comparatively
inexpensive, data quality is often a major
bottleneck. Nevertheless, this approach
offers a viable option for achieving
sufcient customization and quality of
LLMs, while maintaining control over
internal data processing and LLM hosting.
This can become particularly important
in certain use cases, ensuring sustainable
competitive advantage.
appliedAI
16
5) Make application, pre-train LLM – Internal
development of application and pre-
training of LLM from scratch
This approach involves full end-to-end
development (“make”), building the
application itself as well as pre-training
LLMs in-house from scratch. The broader
the applicability of an LLM and the greater
the value it can generate, the better it
is to pursue the "make" approach. This
option is also advisable in highly sensitive
use cases where relying on externally
sourced models is not an option. Although
very costly, developing LLMs from scratch
might be the best option for achieving
optimal customization and quality, and
for ensuring a sustainable competitive
advantage.
6) Stop
If the use case holds limited strategic
value, it is advisable to assign resources to
use cases of higher strategic signicance.
Figure 3 provides a guide of which
approach to use, organized by level of
strategic value of an application and the
degree of customization needed.
Q
What are your thoughts on the potential impact of large
language models in the semiconductor industry, and how do
you see that affecting your company?
A
“In the semiconductor industry there are main value potentials:
improving our processes and creating customer vue. One
area where this potential can be realised is in knowledge retrieval
throughout research and development and manufacturing
processes, leading to enhanced speed and stability, for example in
the case of equipment maintenance. This reduces our dependency
on specic experts with the right domain knowledge being present
24/7 to solve critical issues and helps us train new experts faster.
Moreover, by providing top-notch customer support for our highly
technical products, we can deliver a better customer experience
while increasing the scalability associated with such service.
Additionally, there is signicant room for improving productivity in
support functions, ranging from generating product documentation
to marketing and beyond -- lots of potential.”
- Simon-Pierre Genot, Senior Manager AI Strategy, Inneon Technologies
appliedAI
18
3. Critical Techniques
and Trends in the Field of
Large Language Models:
From Landscape
to Domain-specic
Applications
3.1. Navigating the Landscape of Large Language Models in the
Generative AI Era
3.1.1. Key Techniques, Architectures, and Types of Data
LLMs are an integral part of the generative
AI era. They are complex systems that
can process natural language input and
generate human-like responses. Navigating
the landscape of LLMs in this era requires
an understanding of key techniques as
well as the types of data used in these
models. In this section, we will discuss some
of the fundamental aspects of LLMs that
enable them to function seamlessly, before
describing some key trends observed in this
fast-developing eld.
LLM Strategy Guide
19
The Fundamentals
Transformer as the Base Architecture that
Handles Contextual Meanings
One of the most popular techniques
used in LLMs is the transformer model
architecture, introduced by Vaswani et al. in
2017 [2]. Transformers are neural networks
that can process sequences of data such
as text while being able to handle long-
range dependencies and understand
context. They do this by implementing an
‘attention’ mechanism that allows the model
to process an entire input sequence all at
once and capture the relative importance
of each input token to every other token
in the context. This enables the LLM to
understand the complicated relationships
between words, phrases, etc., even when
they are far apart in the input sequence.
Furthermore, the transformer architecture
offers a key advantage over previous
recurrent neural network models in that it is
highly parallelizable, facilitating large-scale
training on distributed hardware. The basic
transformer architecture has been used or
adapted in some of the most powerful and
popular LLMs, such as GPT-3, T5, and BERT.
Pre-training as a Key Procedure to Equip the
Model with Fundamental Knowledge
Another key technique used in LLMs is pre-
training, which involves training a model on a
large corpus of text data before ne-tuning
it on a specic task. This technique has been
shown to improve the performance of LLMs
on a variety of downstream tasks such as
translating languages, answering questions,
and generating text. Pre-training can be
conducted using a variety of objectives
including language modeling, where the
model is trained to predict the next word in a
sequence, and masked language modeling,
where some of the input tokens are masked
and the model must predict their original
values.
Instruction Tuning & RLHF: Aligning with Human
Preference
Instruction tuning is a fundamental concept
in training LLMs. Early work focused on ne-
tuning LLMs on various publicly available NLP
datasets and evaluating their performance
on different NLP tasks. More recent work,
such as OpenAI's InstructGPT, has been
built on human-created instructions and
demonstrates success in processing diverse
user instructions [3] Subsequent works like
Alpaca and Vicuna have explored open-
domain instruction ne-tuning using open-
source LLMs. Alpaca, for example, used a
dataset of 50k instructions, while Vicuna
leveraged 70k user-shared conversations
from ShareGPT.com. These efforts have
advanced instruction tuning and its
applicability in real-world settings.
Another technique, Reinforcement Learning
from Human Feedback (RLHF), aims to
use methods from reinforcement learning
to optimize language models with human
feedback [4]. Its core training process
involves pre-training a language model,
training a reward model, and ne-tuning the
language model with reinforcement learning.
The reward model is calibrated with human
preferences and generates a scalar reward
that represents these preferences. While
RLHF is promising, to date it has notable
limitations such as the potential for models to
output factually inaccurate text.
Types of Data
LLMs are typically trained on extensive
datasets primarily composed of textual
material from web pages, books, and social
media. However, as will be explained in a
later section, they can also utilize data from
other sources as long as it can be converted
to a sequence of tokens with a known set of
vocabulary’. Hence LaTeX formulas, musical
notes, and programming languages like
appliedAI
20
Python, Java, and C++ may all be adopted as
training data [5]-[7]. This enables the model
to generate novel mathematical or physical
formulas, reason with them, compose
music, and generate code to address bugs
and enhance program efciency, thereby
streamlining the development process.
Additionally, LLMs can leverage SMILE or
SELFIES chemical structures for drug design,
DNA or protein sequences for predicting
protein structures, or genetic mutations
related to diseases [8]-[11].
The scope extends further to encompass
various other modalities like audio, video,
signal data (such as wireless network
signals or depth sensing signals) [12]-[16],
relational or graph database data (such as
stock prices or knowledge graphs) [17],
[18], as well as digital signatures and le
bytes (such as blockchain transactions or
image le bytes) [19],[20]. This huge range
of usable data sources allows the models to
perform tasks such as speech recognition,
action recognition, video summarization,
robotic movement planning, knowledge
graph completion, stock price prediction,
blockchain transaction, or wireless network
transmission anomaly detection, as well
as image classication. While training
models on diverse data types can pose
challenges related to pre-processing
and standardization, it offers signicant
benets as it can unlock new applications
and solutions across various domains. The
ability to process and generate sequential
data from multiple modalities expands the
potential impact and use cases of LLMs,
fostering innovation and problem-solving in
numerous elds (Figure 4).
Modality of data Source of data
Tex t
Image
Audio
Video
3D
Code
Genomics
Chemical
Structures
Webpages
(e.g. Wikipedia, Github
etc.)
Data base
(e.g. Financial data,
Virus, Drug)
Sensor Data
(e.g. Depth, Distance)
Books
Social Media
(e.g. Instagram, TikTok,
YouTube, Twitter etc.)
And More... And More...
Figure 4. Sample data modalities and data sources involved in recent large language models.
Note that both the types of data modalities and the types of data sources are continuously increasing.
LLM Strategy Guide
21
Large Language Models as Foundation Models
LLMs possess the remarkable ability to
generalize knowledge across diverse
contexts, aligning them closely with
the concept of foundation models
[21],[22]. Foundation models capture
relevant information as a versatile
"foundation" for various purposes,
distinguishing them from traditional
approaches. They demonstrate the
characteristic of emergence, with
behaviors implicitly induced rather than
explicitly constructed. LLMs excel in
solving diverse tasks that go beyond
their original language modeling
training [23],[24]. These tasks can
be accomplished just using natural
language prompts, without the need for
explicit training. This in-context learning
capability allows LLMs to perform
tasks such as machine translation,
arithmetic, code generation, answering
questions, and more [25],[26]. In a
zero-shot learning scenario, the model
relies solely on the task descriptions
given in the prompt [27]-[30], while in
a few-shot learning scenario, a small
number of correct answer samples are
incorporated into the prompts [31]-[33].
Meanwhile, the use of chain-of-thought
(CoT) prompting, which provides
step-by-step instructions to guide the
model's answer generation, has been
shown to boost the model's reasoning
capabilities and overall performance
[34]-[36]. These highlight the generality
and adaptability of LLMs as foundation
models.
Homogenization is another key
characteristic of foundation models and
refers to the unifying and consolidating
of methodologies across modeling
approaches, research elds, and
modalities [21]. For example, model
architectures such as BERT, RoBERTa,
GPT, and others have been adopted as
the base architecture for most state-
of-the-art NLP models. This trend
extends beyond the eld of natural
language processing, with similar
transformer-based approaches being
applied in diverse domains such as DNA
sequencing and chemical molecule
generation. In addition, based on similar
principles, foundation models may
be built across modalities. Multimodal
models, which combine data in the
form of texts, audio, images, etc., offer
a valuable fusion of information for
tasks spanning multiple modes. This
convergence of methodologies and
models has streamlined disparate
techniques, leveraging the power of
transformers as a core component.
Homogenization has facilitated cross-
eld research, enabling LLMs to excel
in diverse applications such as drug
discovery, robotic reasoning, and media
generation. Foundation models provide
a base of generalized knowledge that
transcends specic tasks and domains,
revolutionizing the generative AI
landscape.
To summarize, LLMs are powerful neural
network algorithms in the eld of natural
language processing. Key techniques used in
LLMs include transformer architecture, pre-
training, instruction tuning, and RLHF. LLMs are
trained on massive amounts of data gathered
from a huge range of sources and modalities.
As foundation models, they are procient at
generalizing knowledge from vast amounts of
text and showing zero- or few-shot learning
capabilities as well as impressive reasoning
skills, particularly when combined with
techniques like chain-of-thought prompting.
LLMs can accurately complete a wide range
of tasks including understanding language,
generating text, and handling diverse types
of sequences. Understanding the techniques,
architectures, and types of data used in LLMs
as well as their characteristics as foundation
models is essential for navigating the current
and future landscape of generative AI.
appliedAI
22
Beyond the Fundamentals: Key Trends That Shape the Future
In the ever-evolving realm of LLMs, several
key trends have emerged to resolve
previous inadequacies such as heavy costs,
hallucinations, and reasoning fallacies.
These limitations have posed considerable
challenges to the industrialization of
LLMs. Consequently, the research and
development related to these trends will play
a pivotal role in expanding LLM utilization.
The trends surpass foundational aspects and
provide fresh perspectives into the evolving
characteristics of LLMs, unlocking exciting
opportunities for exploration and innovation,
and laying the groundwork for future
advancements.
Efcient Model Architectural Design
A signicant recent advancement in LLM
research pertains to enhancing model
efciency. Efforts have been made to reduce
time and space complexities associated with
LLMs. One such innovation is Receptance
Weighted Key Value (RWKV), which optimizes
model architecture and resource utilization
without compromising performance [37].
Another notable trend relevant to model
architecture design regards techniques
that allow models to efciently handle
longer input sequences (e.g., LongNet [38],
Unlimiformer [39], mLongT5 [40]), thereby
enabling LLMs to process and understand
more comprehensive and context-rich
information at once.
Effective and Precise Dataset Creation
Another burgeoning area of focus is
the effective generation of training and
instruction tuning data, leveraging methods
such as WizardLM to evolve complex
instructions from simple ones, enhancing
the speed of data generation as well as
the diversity of the contents [41]. Other
approaches like MiniPile [42] or INGENIOUS
[43] aim to achieve competitive performance
with a small number of examples. Additionally,
the innovative approach of Domain
Reweighting with Minimax Optimization
(DoReMi) estimates the optimal proportion of
language from different domains in a dataset,
such that LLMs can better adapt to diverse
data sources and enhance their capacity for
generalization [44].
Reconsideration of Model Scaling Laws:
Bigger ≠ Better
The LLM eld has traditionally emphasized
a positive correlation between model
scale and performance improvement. Yet
recent studies challenge this notion by
presenting evidence of inverse scaling,
whereby increased model size leads to
worse task performance [45] (Figure 5). This
phenomenon arises due to factors including
undesirable patterns in the training data and
deviation from a pure next-word prediction
task. These ndings have sparked a shift in
understanding the behavior of larger-scale
models and have highlighted the need for
careful consideration of training objectives
and data selection. Relatedly, exploration
of smaller language models (SLMs) [46]
[48] has demonstrated their efcacy in
specic tasks such as procedural planning
and domain-specic question-answering.
Approaches like PlaSma focus on equipping
SLMs with procedural knowledge and
counterfactual planning capabilities, enabling
them to rival or surpass the performance
of larger models [49]. Similarly, Dr. LLaMA
leverages LLMs to enhance SLMs through
generative data augmentation, yielding
improved performance in domain-specic
question answering tasks [50]. These
developments challenge the conventional
belief that bigger models are inherently
superior and highlight the importance of
carefully tailored data and objectives for
training language models. By adopting a
more nuanced understanding of model
scaling laws, researchers and practitioners
can harness the potential of smaller as
well as larger language models to meet
the demands of diverse applications and
domains.
Alternative Alignment Approaches
Another focus of current research is how best
to align LLMs with human preferences, with
the goal of improving model performance
and interaction quality. Traditional approaches
such as the aforementioned Reinforcement
Learning from Human Feedback (RLHF)
have relied on optimizing LLMs using reward
scores from a human-trained reward model
[3], [4]. These approaches have shown
effectiveness, but come with computational
complexity and heavy memory requirements.
Recent advancements introduce approaches
LLM Strategy Guide
23
Figure 5. Larger models may not necessarily perform better for tasks deviating from
next-word prediction. FLOPs correspond to the amount of computation consumed
during model pre-training, which correlates with model size as well as factors such
as training time or data quantity. Training FLOPs are used rather than model size
alone because computation is considered a better proxy for model performance
in the original paper[45].
such as Sequence Likelihood Calibration
with Human Feedback ([51]) and Reward
Ranking from Human Feedback (RRHF)
[52], which address earlier shortcomings by
calibrating a language model’s sequence
likelihood through ranking of desired versus
undesired outputs. Another method, termed
Less Is More for Alignment (LIMA) [53],
aims to achieve comparable performance
without reinforcement learning by more
efciently ne-tuning models on only 1,000
carefully curated prompts and responses.
These examples present a simpler and
more efcient approach to aligning LLM
output probabilities with human preferences,
facilitating integration of LLMs into practical
applications and enhancing their value.
Incorporation of Cognitively Inspired Memory
Mechanisms
Yet another emerging trend in this eld is
the incorporation of cognitively inspired
memory mechanisms into LLMs, which
takes inspiration from current understanding
of human memory functioning [54][56].
This development aims to improve training
efciency, generalization across tasks,
and long-term interaction capabilities.
For example, to address the forgetting
phenomenon, in which a model's
performance on previously completed
tasks deteriorates, researchers have
proposed Decision Transformers with
Memory (DT-Mem) which integrates an
internal working memory module into LLMs
[57]. By storing, blending, and retrieving
information for different tasks, this proposed
mechanism enhances training efciency
and generalization. Researchers are also
investigating deciencies of long-term
memory in LLMs, referring to models’
limited capacity to sustain interactions
over extended periods. One proposed
solution is MemoryBank, a novel memory
mechanism tailored for LLMs [58]. Inspired
by the Ebbinghaus Forgetting Curve theory,
MemoryBank enables LLMs to summon
relevant memories and continuously update
their memory based on time elapsed and
the signicance of the memory. By emulating
human memory storage mechanisms and
allowing for long-term memory retention,
LLMs could overcome the limitations of
forgetting and sustain meaningful longer-
term interactions.
Magnifying Multimodality
As described earlier, a clear trend in the
continuously evolving eld of LLMs is the
incorporation of more and more modalities
and the improvement of multimodal training
[14], [36], [59][61]. Researchers have
developed approaches like ImageBind, which
learns a joint embedding across multiple
appliedAI
24
modalities such as images, text, audio,
depth, thermal, and inertial measurement
unit data, making cross-modal retrieval,
composition, detection, and generation
possible [62]. ULIP-2, a multimodal pre-
training framework, addresses scalability
and comprehensiveness issues in gathering
multimodal data for 3D understanding by
leveraging LLMs to automatically generate
holistic language counterparts [63]. It has
achieved remarkable improvements in
zero-shot classication and real-world
benchmarks without manual annotation
efforts. Such advancements expand LLM
capabilities, enabling them to understand
and generate across multiple modalities and
perform complex tasks in diverse domains.
From Explainability to Tractability and
Controllability
Novel approaches have also been developed
to enhance the explainability, tractability,
and controllability of LLMs and relevant
applications [64][67]. For example, Control-
GPT leverages the precision of LLMs like
GPT-4 in generating code snippets for
text-to-image generation [68]. By querying
GPT-4 to write graph-generating codes and
using the generated sketches alongside
text instructions, Control-GPT enhances
instruction-following and greatly improves
the controllability of image generation.
Another approach, Backpacks, introduces
a neural architecture that combines strong
modeling performance with interpretability
and control [69]. Backpacks learn multiple
sense vectors for each word and represent a
word as a context-dependent combination
of sense vectors, allowing for interpretable
interventions to change the model's behavior.
Additionally, GeLaTo proposes using tractable
probabilistic models, such as distilled
hidden Markov models, to impose lexical
constraints in autoregressive text generation
[70]. GeLaTo achieves state-of-the-art
performance on constrained text generation
benchmarks, surpassing strong baselines.
Advances like these not only provide insights
into the workings of LLMs but also enable
greater control and customization, enhancing
their performance in computer vision and
text generation tasks.
Hallucination Fixes, Knowledge Augmentation,
Grounding, and Continual Learning
One of the most prominent trends in
recent research is the concerted effort to
tackle hallucination and factual inaccuracy,
two major stumbling blocks to LLM
industrialization [71][74]. Researchers have
pursued multiple approaches to tackle these
problems [75][86]. One approach involves
analyzing and mitigating self-contradictions
in LLM-generated text by designing
frameworks that constrain LLMs to generate
appropriate sentence pairs [87]. Another
aims to enhance the factual correctness
and veriability of LLMs by enabling them to
generate text with citations [88]. This involves
building benchmarks for citation evaluation
and developing metrics that correlate with
human judgment.
Additionally, researchers have introduced
frameworks that augment LLMs with
structured or graph knowledge bases
(‘grounding’) to improve factual correctness
and reduce hallucination. One approach,
Chain of Knowledge (CoK), incorporates
structured knowledge bases that provide
accurate facts and reduce hallucination
[89]. Another technique, Parametric
Knowledge Guiding (PKG) [84], equips LLMs
with a knowledge-guiding module that
accesses relevant knowledge at runtime
without modifying the model's parameters.
These advances in hallucination avoidance,
knowledge augmentation, grounding, and
continual learning contribute to improving
the reliability and accuracy of generated text
across domains and tasks.
Human-like Reasoning and Problem Solving
This trend focuses on enhancing the
reasoning ability of LLMs [90][97].
Researchers have introduced innovative
frameworks such as Tree of Thoughts
(ToT), which enable LLMs to explore and
strategically plan intermediate steps toward
problem-solving [98]. This approach
encourages LLMs to make deliberate
decisions, evaluate choices, and consider
multiple reasoning paths, rather than just
a single one. Another proposed method,
Self-Notes, allows LLMs to deviate from the
input context, enhancing context memory
LLM Strategy Guide
25
and enabling multi-step reasoning [99].
Additionally, OlaGPT introduces a framework
to simulate human cognitive abilities,
including attention, memory, reasoning, and
learning [100]. OlaGPT incorporates an active
learning mechanism to strengthen problem-
solving abilities by recording and referring
to previous mistakes and expert opinions.
These developments in reasoning abilities
pave the way for LLMs to tackle complex
problems more effectively, bridging the gaps
between their current capabilities and human
reasoning.
LLM-guided Articial General Intelligence
Researchers have also recently endeavored
to develop articial general intelligence
on top of LLMs [101][103]. Voyager, an
embodied lifelong learning agent powered
by LLMs, autonomously explores and acquires
skills in Minecraft without human intervention
[104]. It uses an automatic curriculum, an
ever-growing skill library, and an iterative
prompting mechanism to enhance its
abilities. Voyager demonstrates exceptional
prociency in Minecraft, outperforming
prior state-of-the-art methods on various
metrics. Another approach, LLMs as Tool
Makers (LATM), allows LLMs to create their
own reusable tools for problem-solving,
eliminating dependency on existing tools
[105]. LATM consists of two phases, tool
making and tool using, which together enable
LLMs to generate tools for different tasks
and achieve cost effectiveness. LATM has
been validated across complex reasoning
tasks. Additionally, Augmenting Autotelic
Agents with Large Language Models (LMA3)
introduces a language model augmented
autotelic agent that leverages a pre-trained
language model to represent, generate, and
learn diverse and abstract human-relevant
goals [106]. LMA3 demonstrates the ability
to learn a wide range of skills without hand-
coded goal representations or curricula in a
text-based environment. Such innovations
promote the development of articial
general intelligence by empowering LLMs to
autonomously acquire skills, create tools, and
pursue diverse goals.
appliedAI
26
In the landscape of LLMs, there are several
major closed-source (often proprietary)
models and a growing number of open-
source alternatives that offer powerful
capabilities for various natural language
processing tasks. These models have been
developed by leading industry players, open-
source developers, and research institutions,
and they continue to push the boundaries of
what LLMs can achieve. In this section, we
will explore some prominent closed-source
models and the growing area of open-source
alternatives.
Closed-source Models
Prior to GPT-3, most LLMs were openly
available. However, with GPT-3 and
similar models that excel in next word
prediction, there has been a shift towards
proprietary closed-source models. These
are predominantly developed by major
industry players such as OpenAI, Google, and
Microsoft. Table 1 presents a selected list of
these models.
ChatGPT is often considered a service
rather than a standalone model as it
incorporates GPT-3.5 or GPT-4 (for the Plus
version). Likewise, Google's experimental
conversational AI service Bard initially utilized
a lightweight and optimized version of LaMDA
(Language Model for Dialogue Application)
but later transitioned to a more advanced
language model called PaLM 2. Bing Chat,
powered by a customized version of OpenAI's
ChatGPT, integrates Microsoft's search
engine to deliver human-like conversational
responses and improve overall user
experience. Another commercial chatbot,
ERNIE Bot, is built upon Ernie 3.0-Titan. These
conversational AI services, not listed in Table
1, build upon proprietary closed-source
models to provide contextually relevant
conversations and deliver engaging user
experience.
Country Developer & Provider Model Parameters Release
US OpenAI GPT-3 175B Jun 2020
US OpenAI InstructGPT 1.3B, 6B, 175B Jan 2022
US OpenAI GPT-3.5 175B Mar 2022
US OpenAI GPT-4 Unknown Mar 2023
US Microsoft phi-1 1.3B Jun 2023
US Google LaMDA 137B May 2021
US Google GLaM 1.2T Dec 2021
US Google PaLM 540B Apr 2022
US Google PaLM-E 562B Mar 2023
US Google PaLM-2 340B May 2023
US/UK Google DeepMind Gopher 280B Dec 2021
US/UK Google DeepMind Chinchilla 70B Mar 2022
US Amazon AlexaTM 20B Aug 2022
US NVIDIA Megatron Turing NLG 530B Oct 2021
US Bloomberg BloombergGPT 50B Mar 2023
US Anthropic Claude 52B Dec 2021
US Anthropic Claude 2 Unknown Jul 2023
US Cohere Cohere Unknown Nov 2021
China
Baidu & Peng Cheng
Lab.
ERNIE 3.0 Titan 260B Dec 2021
China
Beijing Academy of
Articial Intelligence
Wu Dao 2.0 175B May 2021
China Huawei PanGu-Σ 1T Mar 2023
Israel AI21 Jurassic-1 178B Sept 2021
Israel AI21 Jurassic-2 Unknow Mar 2023
South Korea Naver Corp HyperCLOVA 204B May 2021
Germany Aleph Alpha Luminous 13B, 30B, 70B Nov 2021
Table 1. Selected list of closed-source models after 2020.
3.1.2. Major Closed-source Models and Open-source Alternatives
LLM Strategy Guide
27
Open-Source Alternatives
In the rst half of 2023 especially, there has
been a surge of open-source LLMs, paving
the way for fresh avenues of innovation and
collaboration. In the early stages of this surge,
the open-source landscape consisted mostly
of research-only models such as LLaMA,
Alpaca, and their subsequent iterations,
including Dolly 1.0, GPT4All, GALPACA,
Baize, Koala, Vicuna, LLaVA, WizardLM,
StableVicuna, ImageBind, etc. These models
allowed researchers to study and explore the
capabilities and potentials of LLMs (Figure 6
and Table 2).
Country Developer & Provider Model Parameters Release
US Meta AI OPT-175B 12M-175B May 2022
US Meta AI LLaMA 7B-65B Feb 2023
US Meta AI ImageBind Unknown May 2023
US, China Microsoft, Peking U. WizardLM 7B-65B Apr 2023
US Microsoft Orca 13B Jun 2023
US Stanford University Alpaca 7B Mar 2023
US
Georgia Tech Research
Institute
GALPACA 6.7B, 30B Apr 2023
US, China
University of California, San
Diego, Sun Yat-sen University,
Microsoft Research Asia
Baize 7B-30B Apr 2023
US UC Berkeley Koala 13B Apr 2023
US
University of Wisconsin-
Madison, Microsoft Research,
Columbia University
LLaVA 13B Apr 2023
US Databricks Dolly 1.0 6B Mar 2023
US Nomic AI GPT4All 7B-13B Mar 2023
US LMSYS Org Vicuna 13B Apr 2023
US CarperAI StableVicuna 13B Apr 2023
Singapore
National University of
Singapore
Goat 7B May 2023
France BigScience Bloom 176B Nov 2022
Various OpenOrca OpenOrca-Preview1-13B 13B Jul 2023
GPT-4
BloombergGPT
PanGu-Σ
Palm-E
Jurassic-2
PaLM 2 phi-1 Glaude 2
Alpaca
Alpaca-LoRA
LLaMA LLaVA
WizardLM
StableVicuna
Dolly 1.0
GPT4All
GALPACA
Baize
Koala
Vicuna
ImageBind Goat
Orca
Xgen-7B-8K-
Inst
OpenOrca-
Preview1-13B
CodeGen2.5-
7B-instruct
Flan-UL2
OpenChatKit
OpenAssistant
StableLM
MOSS
h2ogpt
FastChat-T5
Cerebras-
GPT
OpenLLaMA
OpenAlpaca
replit-code
StarCoder 15B
StarChat
Alpha
MPT-7B
RedPajama-
INCITE
Dlite V2
RWKV
Falcon-40B
CodeT5+
baichuan-7B
OpenLLaMA-
13B
MPT-30B
Xgen-7B-4K/
8K-Base
CodeGen2.5-
7B-mono/multi
Llama 2
StarCoder
1/3/7B
Closed-Source
Open Source
- Research
Open Source
- Commercial
Jul 15Jul 1Jun 15Jun 1May 15May 1Apr 15Apr 1Mar 15Mar 1Feb 15Feb 1
Figure 6. Major large language models released between February and July 2023
Table 2. Selected list of open-source non-commercial models.
appliedAI
28
Country Developer & Provider Model Parameters Release
US EleutherAI GPT-J-6B 6B Jun 2021
US EleutherAI GPT-NeoX-20B 20B Apr 2022
US Google UL2 20B May 2022
US Google Flan T5 80M-11B Oct 2022
US Google Flan UL2 20B Mar 2023
US Cerebras Cerebras-GPT 111M-13B Mar 2023
US Nomic AI GPT4All-J 6B Apr 2023
US EleutherAI Pythia 70M-12B Apr 2023
US Databricks Dolly 2.0 3B-12B Apr 2023
US H2O.ai h2oGPT 12B Apr 2023
US LMSYS Org FastChat-T5 3B Apr 2023
US AI Squared Dlite V2 124M-1.5B May 2023
US, Spain,
UK
RWKV Foundation, EleutherAI,
University of Barcelona, Charm
Therapeutics, Ohio State University
RWKV 169M-14B May 2023
US MosaicML MPT-7B, 30B 7B, 30B May-Jun 2023
US Together RedPajama-INCITE 3B, 7B May 2023
US OpenLM Research, Stability AI OpenLLaMA 3B, 7B, 13B May-Jun 2023
US Meta AI Llama 2 7B-70B Jul 2023
UK Stability AI StableLM-Alpha 3B-65B Apr 2023
Germany LAION AI
Open Assistant
(Pythia family)
12B Apr 2023
UAE Technology Innovation Institute Falcon 7B, 40B May 2023
China Baichuan baichuan-7B 7B Jun 2023
Table 3. Selected list of open-source large language models that allow potential commercial usage.
The open-source landscape has expanded
considerably since then, with many models
and datasets emerging that allow potential
commercial usage
1.
Notable among these
models are Cerebras-GPT, Pythia, Dolly
2.0, GPT4All-J, OpenAssistant, StableLM,
H2OGPT, OpenLLaMA, OpenAlpaca, MPT-
7B, and RedPajama-INCITE. Together with
open-source models, open-source datasets
like RedPajama-Data-1T and StarCoderData
have been generated and their data curation
methods published, further widening the
possibilities for commercial applications.
These models and datasets aim to provide
accessible and customizable alternatives to
proprietary commercial options (Tables 3-5).
Overall, these open-source models
have helped create fertile ground for
experimentation, enabling researchers,
developers, and practitioners to explore novel
applications and collaborate on advancing
the eld of generative AI without costly
licensing agreements. The availability of
these models has democratized access to
sophisticated generative language modeling
techniques, promoting a more inclusive
and vibrant AI development ecosystem.
Developers, researchers, and practitioners
now have the opportunity to leverage and
contribute to these open-source models,
driving transformative breakthroughs in
diverse applications.
1 The models and datasets mentioned in the tables in this section have been curated based on various sources including providers' ofcial
announcements and Github repositories, Hugging Face model cards (https://huggingface.co/models) and open-source knowledge
graphs or lists such as the Stanford foundation models ecosystem graph (https://crfm.stanford.edu/) and the Open LLMs Github repository
(https://github.com/eugeneyan/open-llms). While these tables serve as a starting point for readers to explore commercially usable
models and datasets, it is important to note that licenses for model weights, source codes, or datasets may vary across different branches
and downstream products and may be subject to changes across different versions as they evolve. Also, the associated permissive licenses
(e.g., CC BY-SA-4.0, Apache 2.0, BSD-3-Clause, MIT, OpenRAIL-M v1) may have different nuances concerning liability, warranty, patent use,
copyright, etc. As a best practice, readers should always verify with the original providers the up-to-date licensing conditions of the models
as well as those of the associated model weights, source codes, and datasets before engaging in extensive development and launching of
commercial usage.
LLM Strategy Guide
29
Country Developer & Provider Model Parameters Release
US EleutherAI The Pile 825GB Dec 2020
US Anthropic Helpful and Harmless 79.3MB Apr 2022
US Together RedPajama-Data-1T 5TB Apr 2023
US Databricks databricks-dolly-15k 13.1MB Apr 2023
France/US
Hugging Face & ServiceNow
(BigCode Project)
The Stack 6TB
Nov 2022-
Feb 2023
France/US
Hugging Face & ServiceNow
(BigCode Project)
StarCoderData 882GB May 2023
Germany LAION AI LAION-5B 11.4TB Jun 2022
Germany LAION AI OIG Dataset 44M Mar 2023
Germany LAION AI
OASST1 (OpenAssistant
Conversations Dataset)
41.6MB Apr 2023
UK University College London MiniPile 6GB Apr 2023
Table 5. Selected list of open-source datasets that allow potential commercial usage.
Country Developer & Provider Model Parameters Release
US Replit Replit Code 2.7B May 2023
US Salesforce Research CodeGen2 16B May 2023
US Salesforce Research CodeT5+ 16B May 2023
US Salesforce Research Xgen-7B-4K/8K-Base 7B Jun 2023
US Salesforce Research
CodeGen2.5-7B-mono/
multi
7B Jul 2023
France/US
Hugging Face & ServiceNow
(BigCode Project)
SantaCoder 1.1B Jan 2023
France/US
Hugging Face & ServiceNow
(BigCode Project)
StarCoder 15B, 1B, 3B, 7B May, Jul 2023
France/US
Hugging Face & ServiceNow
(BigCode Project)
StarChat Alpha 16B May 2023
Table 4. Selected list of open-source code-oriented large language models that allow potential commercial
usage.
Q
Could you share your insights on the potential long-term
impacts of increasingly advanced open-source LLMs on the
European industry?
A
“The emergence of increasingly advanced open-source LLMs
can have signicant long-term impact on the European industry.
It offers opportunities for European companies to leverage and build
upon these models to develop innovative AI solutions. In contrast to
proprietary (non-European) offerings, this reduces dependence on
foreign technology, strengthens intellectual property, and facilitates
regulatory compliance.”
- Dr. Stephan Meyer, Head of Articial Intelligence, Munich Re Group
appliedAI
30
In short, the current LLM landscape includes
major closed-source models and an expanding
range of open-source alternatives. These
models offer powerful capabilities for various
NLP tasks including chatbots, knowledge-base
question-answering, language generation,
and more. Closed-source, commercial
models provide robust solutions backed by
industry expertise, whereas the recent open-
source explosion has led to the emergence
of research-only models and commercially
usable alternatives, allowing for increased
collaboration and innovation. Striking a balance
between closed-source and open-source
models may help organizations benet from
the respective strengths of these different
types of models and drive LLM progress.
Best of Both Worlds
This growing open-source ecosystem
offers the possibility of a balance between
closed-source and open-source models. In a
recent paper on the concept of ‘FrugalGPT’,
for example, authors put forward the idea
of integrating different types of LLMs to
optimize costs and achieve better outcomes
(Figure 7; [107]). By embracing both closed-
source models and open-source alternatives,
organizations can have the best of both
worlds, for example, by leveraging GPT-4 as
a high-level reasoning and planning engine
and then using open-source models to
complete specic tasks in contexts where
their performance excels.
Figure 7. FrugalGPT demonstrates how to use large
language models while reducing cost and improving
performance [107].
LLM Strategy Guide
31
3.1.3. Flourishing Large Language Model Applications, Extensions, and
Relevant Frameworks
With the recent surge in enthusiasm around
generative AI sparked by the launch of
ChatGPT, there has been a rapid expansion
in the number of possible applications,
extensions, and frameworks that center
around LLMS. These developments open
up myriad new possibilities and pave the
way for transformative advances. In this
section, we explore some of the major recent
developments in this exciting era.
Agentic AI
The term Agentic AI refers to articial
intelligence systems that can make
autonomous decisions and take proactive
action based on their understanding of a
given situation. The emergence of LLMs
has accelerated the development of
agentic AI as such models can act as a
reasoning engine or a core controller for
intelligent planning and execution behavior.
Applications such as AgentGPT, AgentLLM,
Transformers Agent, Langchain Agents, and
Auto-GPT demonstrate the capability to
act as virtual agents, enabling autonomous
decision-making and interaction in
dynamic environments to complete target
tasks. Agentic LLMs have the potential to
revolutionize elds such as customer service,
virtual assistants, and autonomous systems.
Coding Assistants and Coding-oriented Models
Another area of signicant development
is that of coding assistants and relevant
models, which aim to enhance software
development workows. GitHub Copilot,
StarCoder/StarChat, CodeGen2 [108], and
CodeT5+ [109] are prominent examples.
These coding assistants and models leverage
LLMs to provide intelligent code suggestions
and evaluations, generate code snippets
or comments, and assist developers in
improving and optimizing the quality and
efciency of code. By automating repetitive
tasks and offering intelligent guidance, these
tools boost developer productivity and
facilitate rapid prototyping.
LLM Programming
The development of LLM programming
techniques has enabled the creation of novel
frameworks to interact with LLMs. LMQL and
Low-code LLM [110] are examples of tools
that either allow developers to interweave
prompts with a control ow (e.g., loops)
to increase exibility and reusability of the
prompts, or that incorporate simple low-
code visual programming interactions to
effectively utilize LLMs for complex tasks.
This approach facilitates more controllable
and stable LLM responses, making it easier to
build applications and automate tasks.
LLM-powered Document Analyzers
LLMs have also been applied to document
analysis tasks, leading to the development of
assistants such as Arches AI, PDF GPT, and
HUMANTA. These tools leverage the power of
LLMs to assist with tasks such as document
summarization, information extraction, and
context-aware analysis. By automating these
processes, document analysis assistants
can streamline workows and improve
productivity in industries including legal,
nance, and research.
LLM-powered Chatbots and Playgrounds
Advances in LLMs have spurred the
development of chatbots and playgrounds
that facilitate interactive and engaging
conversations. OpenAssistant, Vercel.ai
playground, Chatbot Arena, h2oGPT, and
HuggingChat are notable examples in this
domain. These platforms allow users to
interact with LLM-powered chatbots, explore
creative dialogues, and even develop their
own conversational agents.
LLM-powered Domain-Specic Assistants
LLMs have also been customized for specic
industries, giving rise to domain-specic
assistants. FinChat.io, for instance, caters
specically to the nance industry, providing
intelligent support for tasks like nancial
analysis, investment recommendations, and
risk assessment. These assistants leverage
domain-specic knowledge and language to
deliver tailored solutions to industry-specic
challenges.
appliedAI
32
Model Training, Fine-tuning, and Management
Platforms
As the complexity of LLMs increases,
there is growing demand for efcient
and standardized model training, ne-
tuning, and management platforms. It
is becoming essential to have robust
platforms that facilitate the entire lifecycle
of these models. Platforms like H2O LLM
Studio, deepspeed Zero++, NVIDIA Nemo
Framework, and MosaicML offer unied
solutions for model training, ne-tuning,
deployment, and monitoring. By providing
predened workows, tools, and resources,
these platforms simplify the process of
customizing, ne-tuning, and deploying LLMs,
allowing users to leverage the knowledge
encoded within the models while being
able to tailor models to specic domains or
applications and maintain them with minimal
difculty.
Model Compilation and Quantization
Frameworks
With larger and more resource-intensive
LLMs comes the increasing need for model
compilation and quantization frameworks.
WebLLM and similar tools provide methods
for optimizing and compiling LLMs to
reduce their memory footprint and
improve efciency. These frameworks
enable deployment of LLMs on resource-
constrained devices, running inferences even
from a web browser, creating the potential for
private, on-device language processing and
real-time personal applications.
Other LLM-based Applications
Beyond these categories, a vast array of LLM-
based tools and applications is on the horizon.
These include marketing tools for sentiment
analysis and content generation, knowledge
organization platforms for information
retrieval and knowledge discovery, text-to-
image/video generation models for creative
content production, music generation
models for composition and harmonization,
data analysis frameworks for language-driven
insights, voice generation models for natural
and expressive speech synthesis, gaming
applications for interactive storytelling,
and even pharmaceutical applications for
molecular compound nding, drug discovery,
and protein design.
The diverse range of LLM applications,
extensions, and frameworks highlights the
versatility and potential of LLMs to address
complex challenges across industries.
By harnessing the power of language
understanding and generation, organizations
can unlock new opportunities for automation,
innovation, and user experience. As the LLM
landscape continues to evolve, it is crucial
for businesses to stay informed about the
latest advances and consider how these tools
could be leveraged to drive their own digital
transformation and competitive advantage.
Q
In your view, which area of industrial LLM applications shows
the most promise for the near future?
A
“In the future, I envision employees seamlessly collaborating
with specialized AI assistants to efciently address daily
internal tasks or inquiries by customers. These AI assistants will
adeptly access, extract, and integrate relevant knowledge, offering
recommended solutions and providing detailed, step-by-step
guidance to execute processes effectively.”
- Bernhard Pugfelder, Head of Use Cases and Applications, appliedAI
Initiative GmbH
LLM Strategy Guide
33
While LLMs are trained on vast amounts of
general text data, they can sometimes lack
the necessary knowledge for specialized
applications. In such cases, ne-tuning
models on a smaller, specic dataset can
signicantly improve performance in that
area. Such an approach allows organizations
to adapt pre-trained models to their specic
needs, and subsequently improve accuracy,
relevance, and efciency [111]. In this section,
we will explore some fundamental concepts
of ne-tuning and adaptation, before
discussing key trends in the development
of these techniques that could expand the
potential of LLMs even further.
Fundamentals
The terms ne-tuning and adaptation are
often used interchangeably to describe
closely related techniques used to further
train a pre-trained LLM using domain-specic
data. The objective is to enable the model
to learn specic patterns and nuances
unique to the target domain while preserving
the core knowledge acquired during pre-
training. Fine-tuning is particularly effective
when dealing with narrow domains that
have limited annotated data available for
training. By exposing the model to domain-
specic data, it can adapt to the vocabulary,
style, and distinctive characteristics of the
3.2. Domain-Specic Application of Large Language Models in
Industrial Scenarios
3.2.1. Fine-tuning and Adaptation from a Technical Perspective: To What
Extent Are They Needed and How Could They Help?
domain. These approaches result in improved
performance and better alignment with
specic task requirements, enhancing the
applicability of LLMs in domain-specic,
specialized industrial scenarios.
Full-scale ne-tuning of LLMs poses
computational challenges due to the
extensive number of parameters involved
and requires sufciently large dedicated
hardware resources. Various parameter-
efcient ne-tuning (PEFT) approaches
have been developed to address this. Such
approaches employ techniques such as
modication of model input and insertion of
trainable parameters into different parts of
the model architecture, including hard/soft
prompt tuning [112], prex-tuning [113], and
adapter-based tuning (e.g., neural adapters
[114], LoRA [115], LLaMA-adapter [116], [117]).
Hard prompt tuning methods modify discrete
model input tokens to guide the model's
output. Soft prompt tuning optimizes
continuous feature vectors derived from the
discrete token input layer using gradient-
based methods. Techniques involving
inserted parameters (prex-tuning and
adapter-based tuning) typically encapsulate
them within simple modules that facilitate
the language model's adaptation to target
domains or tasks. These added modules
possess desirable characteristics such as
simplicity with a small parameter count,
extensibility to the original language models,
and exibility for sequential training on
specic domains. By integrating these
additional parameters into different parts of
the existing LLM architecture, task-specic
learning can be achieved, allowing models to
be customized for specic tasks or domains.
These parameter-efcient ne-tuning
approaches aim to strike a balance between
model performance and computational
resources, such that LLMs can be tailored to
meet specic requirements without the need
for extensive computational infrastructure
(Figure 8).
Parameter-efficient
Finetuning (PEFT)
Prompt-based
tuning
Hard
prompt
tuning
Soft
prompt
tuning
Prefix-
tuning
Adapter-
based
tuning
Tuning of inserted
parameters
Figure 8. Types of Parameter-efcient Fine-tuning
appliedAI
34
Making a Decision to Fine-tune
The decision to pursue ne-tuning of LLMs
or use out-of-the-box models depends on
factors such as data privacy, information
security, budgets, make-or-buy and vendor
strategy, and requirements for model
diversity. Empirical evaluation results also
play a role. Fine-tuning may be preferable
when dealing with domain-specic tasks
that require a high level of customization
and performance optimization or when the
out-of-the-box model's performance is
unsatisfactory, presumably due to insufcient
exposure to domain-relevant data during
pre-training. Out-of-the-box LLMs are more
suitable for general-purpose applications
or when the task aligns well with the pre-
existing knowledge encoded in the models.
These models can offer convenient and
efcient solutions without the need for
extensive ne-tuning.
Benets of Fine-tuning and Adaptation from a
Technical Perspective
The benets of ne-tuning and adaptation
are evident when it comes to addressing
the specic challenges and requirements of
domain-specic scenarios. By applying these
approaches, organizations can achieve the
following:
1) Improved Performance. Fine-tuning an
LLM with domain-specic data can
signicantly improve its performance and
accuracy on specic tasks. The model
becomes more adept at understanding
the intricacies of the target domain,
leading to more reliable and precise
results.
2) Enhanced Relevance. Adaptation allows
LLMs to understand and generate
content in different languages or
professional jargon. This is particularly
valuable in industrial settings where
models need to process and generate
text using appropriate terms to cater to
a specialized user base. Adapting the
model to company-internal language can
help ensure that the generated output is
technically appropriate for that context.
3) Personalization and Tailored Outputs:
Customization enables companies
to create models that align closely
with their specic business needs. By
incorporating domain-specic data or
organization-specic criteria, models
can generate outputs that are highly
relevant, personalized, and aligned
with the organization's objectives. This
level of customization enhances user
experience and enables more effective
communication with customers or users.
Challenges of Fine-tuning from a Technical
Perspective
Despite these benets, there are limitations
and challenges associated with these
approaches. Fine-tuning and adaptation
require carefully curated datasets that
accurately represent the target domain,
language, or business context. Obtaining
high-quality and representative data can be
a challenge, especially in niche or specialized
domains where labeled data may be scarce.
Additionally, ne-tuning, adaptation, and
customization require expertise in machine
learning and NLP techniques, as well as
sufcient computational resources to train
and deploy models effectively.
Fine-tuning, adaptation, and customization
offer advantages when it comes to leveraging
the potential of LLMs in domain-specic
scenarios. These approaches enable
organizations to tailor models to their specic
needs, resulting in improved performance,
relevance, and personalization. A caveat is
that careful consideration must be given to
the availability of high-quality data, expertise,
and computational resources required for
implementation. By understanding and utilizing
these techniques, organizations can unlock the
full potential of LLMs and drive innovation in
their respective industries.
LLM Strategy Guide
35
Beyond the Fundamentals: Key Trends That Shape the Future
As with modeling techniques, there are
notable trends emerging in ne-tuning,
adaptation, and customization methods.
These trends go beyond foundational
concepts and provide fresh perspectives on
the ever-evolving landscape of LLMs. They
present exciting opportunities for further
exploration and innovation, pushing the
boundaries of what can be achieved.
Low-cost and Efcient Fine-tuning
Researchers are investigating methods to
ne-tune LLMs more efciently [118][119],
with several innovative approaches showing
promise. QLoRA, for example, implements
LoRA on quantized LLMs and has reached
99.3% of the performance level of ChatGPT
while requiring just 24 hours of ne-tuning
on a single GPU [120]. Another noteworthy
recent technique, memory-efcient zeroth-
order optimizer (MeZO), addresses the issue
of memory consumption during ne-tuning
[121]. By reducing memory requirements
to the level of inference, MeZO enables
efcient training of 30-billion parameter
models using a single A100 80GB GPU. These
improvements in ne-tuning efciency not
only accelerate the adaptation of LLMs to
specic domains or tasks but also resolve
problems with resource constraints and
time-intensive processes. Optimization of
the ne-tuning process can further help
organizations exploit the full potential of LLMs
while reducing the time and computational
resources required.
Finetuning-free Approaches
Another recent trend involves methods that
achieve comparable performance without
ne-tuning. Researchers have, for example,
introduced a mechanism called "distilling
step-by-step" that trains smaller models
using LLM rationales as additional supervision
within a multi-task training framework
[46]. These smaller models achieve better
performance with fewer labeled/unlabeled
training examples and substantially smaller
model sizes, while still outperforming LLMs
on benchmark tasks. Such trends highlight
ongoing efforts to push the boundaries of
LLM capabilities and overcome challenges in
practical applications.
3.2.2. Towards Domain-specic Dynamic Benchmarking Approaches
Benchmarking is a crucial process for
evaluating LLM performance [122]. It
involves measuring and comparing various
metrics to assess models’ capabilities and
limitations. With continued advances in the
eld, there is an increasing need for effective
benchmarking. This section focuses on
techniques used to benchmark LLMs and
emphasizes the signicance of dynamic
benchmarking.
Traditional Approaches
Benchmarking of LLMs encompasses a
range of methods and metrics. One common
approach is to evaluate models on standard
NLP tasks, such as text classication,
sentiment analysis, machine translation, and
question-answering. These tasks serve as
benchmarks to gauge model performance
and provide a basis for comparison across
different models. Additional metrics such as
accuracy, precision, recall, and F1 score have
also been widely used to quantify model
performance.
Another approach involves using
datasets specically designed to evaluate
performance. These datasets may include
diverse linguistic phenomena such as
syntactic structures, semantic relationships,
and pragmatic understanding. By evaluating
performance on these datasets, researchers
can better understand a model’s capacity to
handle complex linguistic tasks.
The Need for Dynamic Benchmarking
Though traditional benchmarking
approaches remain useful, they are not
without limitations. LLMs are renowned
for their capacity to adapt and enhance
performance over time through continual
learning. Similarly, the content and style
appliedAI
36
of model input, such as the inclusion of
latest news reports or research ndings, is
likely to evolve as time progresses. These
considerations highlight the inadequacy
of static benchmarks in assessing a
model’s potential. This is where dynamic
benchmarking comes into play[123][124].
Dynamic benchmarking involves continuously
evaluating and updating benchmarks as the
model evolves. By periodically assessing
the model's performance on new tasks and
datasets, researchers can track progress and
identify areas that require improvement.
Advantages of Dynamic Benchmarking
Dynamic benchmarking offers several
advantages over static benchmarking.
First, it enables researchers to evaluate the
performance of LLMs in real-world scenarios
that evolve over time, taking into account
factors such as ‘domain shift’ or ‘concept
drift’ where data changes. As language
evolves, new linguistic phenomena emerge
and models must be able to adapt to these
changes. Dynamic benchmarking allows
researchers to assess a model's ability to
handle novel and evolving linguistic changes.
Second, dynamic benchmarking promotes
innovation and drives further research and
development in the eld. By continuously
evaluating model performance, researchers
can identify weaknesses and focus on
addressing limitations. This iterative process
encourages development of more advanced
techniques and architectures to improve LLM
performance.
Benchmarking plays a crucial role in evaluating
LLM performance. While traditional static
benchmarks provide valuable insights,
dynamic benchmarking offers a more
comprehensive and timely understanding of
a model's capabilities. It allows researchers
to track a model's progress, identify areas
for improvement, and ensure its suitability
for evolving real-world language challenges.
Dynamic benchmarking promotes innovation,
drives research, and enables strategic
managers and NLP practitioners to make
informed decisions about deploying LLMs in
their respective domains.
Q
Looking into the crystal ball - to
what extent will LLMs be integrated
into everyday human and corporate
activities in 2030?
A
“They will be everywhere. But we talk
about 2025 and not 2030.”
- Dr. Andreas Liebl, Managing Director and
Founder, appliedAI Initiative GmbH
Third, dynamic benchmarking provides
a more comprehensive and up-to-date
understanding of a model's strengths
and weaknesses. As benchmarks evolve,
researchers learn about a model's
performance across domains, languages,
and tasks. This information is invaluable for
strategic managers and NLP practitioners
who need to assess the suitability of LLMs for
specic applications.
Implementation
To implement dynamic benchmarking,
researchers require access to diverse and
evolving datasets that reect real-world
language use. These should include domain-
specic data, multilingual data, and data
that captures the nuances and complexities
of natural language. Collaborations with
industry partners, academia, and the wider
NLP community can help gather and curate
datasets to support dynamic benchmarking.
LLM Strategy Guide
37
appliedAI
38
References
[1] W. X. Zhao et al., “A Survey of Large Language Models.” arXiv, Jun. 29, 2023. Accessed: Jul. 13, 2023.
[Online]. Available: http://arxiv.org/abs/2303.18223
[2] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Feb. 15, 2023. [Online].
Available: http://arxiv.org/abs/1706.03762
[3] L. Ouyang et al., “Training language models to follow instructions with human feedback.” arXiv, Mar.
04, 2022. Accessed: Feb. 13, 2023. [Online]. Available: http://arxiv.org/abs/2203.02155
[4] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning
from human preferences.” arXiv, Feb. 17, 2023. Accessed: Jul. 10, 2023. [Online]. Available: http://arxiv.
org/abs/1706.03741
[5] R. Taylor et al., “Galactica: A Large Language Model for Science.” arXiv, Nov. 16, 2022. doi: 10.48550/
arXiv.2211.09085.
[6] D. Dasgupta, D. Venugopal, and K. D. Gupta, “A Review of Generative AI from Historical Perspectives.”
TechRxiv, Feb. 17, 2023. doi: 10.36227/techrxiv.22097942.v1.
[7] R. Huang et al., “AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.”
arXiv, Apr. 25, 2023. Accessed: Apr. 26, 2023. [Online]. Available: http://arxiv.org/abs/2304.12995
[8] X. Yang, W. Cheng, L. Petzold, W. Y. Wang, and H. Chen, “DNA-GPT: Divergent N-Gram Analysis for
Training-Free Detection of GPT-Generated Text.” arXiv, May 26, 2023. Accessed: May 30, 2023.
[Online]. Available: http://arxiv.org/abs/2305.17359
[9] Q. Jin, Y. Yang, Q. Chen, and Z. Lu, “GeneGPT: Augmenting Large Language Models with Domain
Tools for Improved Access to Biomedical Information,” ArXiv, p. arXiv:2304.09667v2, Apr. 2023.
[10] A. Yüksel, E. Ulusoy, A. Ünlü, G. Deniz, and T. Doğan, “SELFormer: Molecular Representation Learning
via SELFIES Language Models.” arXiv, Apr. 10, 2023. Accessed: Apr. 25, 2023. [Online]. Available: http://
arxiv.org/abs/2304.04662
[11] T. S. Frisby and C. J. Langmead, “Identifying Promising Sequences For Protein Engineering Using
A Deep Transformer Protein Language Model.” bioRxiv, p. 2023.02.15.528697, Feb. 16, 2023. doi:
10.1101/2023.02.15.528697.
[12] S. Islam et al., “A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks.”
arXiv, Jun. 11, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/abs/2306.07303
[13] L. Bariah, Q. Zhao, H. Zou, Y. Tian, F. Bader, and M. Debbah, “Large Language Models for Telecom:
The Next Big Thing?” arXiv, Jun. 16, 2023. Accessed: Jun. 26, 2023. [Online]. Available: http://arxiv.org/
abs/2306.10249
[14] S. Yin et al., “A Survey on Multimodal Large Language Models.” arXiv, Jun. 23, 2023. Accessed: Jun. 27,
2023. [Online]. Available: http://arxiv.org/abs/2306.13549
[15] G. Chen et al., “VideoLLM: Modeling Video Sequence with Large Language Models.” arXiv, May 23,
2023. Accessed: May 25, 2023. [Online]. Available: http://arxiv.org/abs/2305.13292
[16] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-ChatGPT: Towards Detailed Video Understanding
via Large Vision and Language Models.” arXiv, Jun. 08, 2023. Accessed: Jun. 14, 2023. [Online].
Available: http://arxiv.org/abs/2306.05424
[17] T. Muhammad et al., “Transformer-Based Deep Learning Model for Stock Price Prediction: A Case
Study on Bangladesh Stock Market,” Int. J. Comp. Intel. Appl., p. 2350013, Apr. 2023, doi: 10.1142/
S146902682350013X.
[18] Z. Bi et al., “Relphormer: Relational Graph Transformer for Knowledge Graph Representations.” arXiv,
Mar. 14, 2023. Accessed: Jul. 06, 2023. [Online]. Available: http://arxiv.org/abs/2205.10852
[19] Y. Gai, L. Zhou, K. Qin, D. Song, and A. Gervais, “Blockchain Large Language Models.” arXiv, Apr. 29,
2023. Accessed: May 02, 2023. [Online]. Available: http://arxiv.org/abs/2304.12749
[20] M. Horton, S. Mehta, A. Farhadi, and M. Rastegari, “Bytes Are All You Need: Transformers Operating
Directly On File Bytes.” arXiv, May 31, 2023. Accessed: Jun. 05, 2023. [Online]. Available: http://arxiv.
org/abs/2306.00238
[21] R. Bommasani et al., “On the Opportunities and Risks of Foundation Models.” arXiv, Jul. 12, 2022.
Accessed: Feb. 28, 2023. [Online]. Available: http://arxiv.org/abs/2108.07258
LLM Strategy Guide
39
[22] C. Zhou et al., “A Comprehensive Survey on Pretrained Foundation Models: A History from BERT
to ChatGPT.” arXiv, Feb. 18, 2023. Accessed: Feb. 23, 2023. [Online]. Available: http://arxiv.org/
abs/2302.09419
[23] H. Jiang, “A Latent Space Theory for Emergent Abilities in Large Language Models.” arXiv, Apr. 24,
2023. Accessed: Apr. 25, 2023. [Online]. Available: http://arxiv.org/abs/2304.09960
[24] R. Schaeffer, B. Miranda, and S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?”
arXiv, Apr. 28, 2023. Accessed: May 02, 2023. [Online]. Available: http://arxiv.org/abs/2304.15004
[25] K. Ahuja and D. Lopez-Paz, “A Closer Look at In-Context Learning under Distribution Shifts.” arXiv,
May 26, 2023. Accessed: May 30, 2023. [Online]. Available: http://arxiv.org/abs/2305.16704
[26] X. Han, D. Simig, T. Mihaylov, Y. Tsvetkov, A. Celikyilmaz, and T. Wang, “Understanding In-Context
Learning via Supportive Pretraining Data.” arXiv, Jun. 26, 2023. Accessed: Jun. 28, 2023. [Online].
Available: http://arxiv.org/abs/2306.15091
[27] Y. Hou et al., “Large Language Models are Zero-Shot Rankers for Recommender Systems.” arXiv, May
15, 2023. Accessed: May 19, 2023. [Online]. Available: http://arxiv.org/abs/2305.08845
[28] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-Shot
Reasoners.” arXiv, Jan. 29, 2023. Accessed: Feb. 13, 2023. [Online]. Available: http://arxiv.org/
abs/2205.11916
[29] T. Shen, G. Long, X. Geng, C. Tao, T. Zhou, and D. Jiang, “Large Language Models are Strong Zero-
Shot Retriever.” arXiv, Apr. 27, 2023. Accessed: May 02, 2023. [Online]. Available: http://arxiv.org/
abs/2304.14233
[30] Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting Large Language Models for Zero-Shot Domain Adaptation in
Speech Recognition.” arXiv, Jun. 28, 2023. Accessed: Jul. 03, 2023. [Online]. Available: http://arxiv.org/
abs/2306.16007
[31] S. Albanie, L. Momeni, and J. F. Henriques, “Large Language Models are Few-shot Publication
Scoopers.” arXiv, Apr. 02, 2023. Accessed: Apr. 06, 2023. [Online]. Available: http://arxiv.org/
abs/2304.00521
[32] Z. Chen, M. M. Balan, and K. Brown, “Language Models are Few-shot Learners for Prognostic
Prediction.” arXiv, Feb. 26, 2023. Accessed: Mar. 02, 2023. [Online]. Available: http://arxiv.org/
abs/2302.12692
[33] T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020. doi: 10.48550/
arXiv.2005.14165.
[34] S. Diao, P. Wang, Y. Lin, and T. Zhang, “Active Prompting with Chain-of-Thought for Large Language
Models.” arXiv, Feb. 26, 2023. Accessed: Feb. 28, 2023. [Online]. Available: http://arxiv.org/
abs/2302.12246
[35] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and T. Khot, “Chain-of-Thought Hub: A Continuous Effort to
Measure Large Language Models’ Reasoning Performance.” arXiv, May 26, 2023. Accessed: May 30,
2023. [Online]. Available: http://arxiv.org/abs/2305.17306
[36] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal Chain-of-Thought Reasoning
in Language Models.” arXiv, Feb. 08, 2023. Accessed: Feb. 16, 2023. [Online]. Available: http://arxiv.
org/abs/2302.00923
[37] B. Peng et al., “RWKV: Reinventing RNNs for the Transformer Era.” arXiv, May 22, 2023. Accessed: May
24, 2023. [Online]. Available: http://arxiv.org/abs/2305.13048
[38] J. Ding et al., “LongNet: Scaling Transformers to 1,000,000,000 Tokens.” arXiv, Jul. 05, 2023. Accessed:
Jul. 07, 2023. [Online]. Available: http://arxiv.org/abs/2307.02486
[39] A. Bertsch, U. Alon, G. Neubig, and M. R. Gormley, “Unlimiformer: Long-Range Transformers with
Unlimited Length Input.” arXiv, May 02, 2023. Accessed: May 03, 2023. [Online]. Available: http://arxiv.
org/abs/2305.01625
[40] D. Uthus, S. Ontañón, J. Ainslie, and M. Guo, “mLongT5: A Multilingual and Efcient Text-To-Text
Transformer for Longer Sequences.” arXiv, May 18, 2023. Accessed: May 22, 2023. [Online]. Available:
http://arxiv.org/abs/2305.11129
[41] C. Xu et al., “WizardLM: Empowering Large Language Models to Follow Complex Instructions.” arXiv,
Apr. 24, 2023. Accessed: Apr. 28, 2023. [Online]. Available: http://arxiv.org/abs/2304.12244
[42] J. Kaddour, “The MiniPile Challenge for Data-Efcient Language Models.” arXiv, Apr. 17, 2023.
appliedAI
40
Accessed: Apr. 20, 2023. [Online]. Available: http://arxiv.org/abs/2304.08442
[43] H. S. V. N. S. K. Renduchintala et al., “INGENIOUS: Using Informative Data Subsets for Efcient Pre-
Training of Large Language Models.” arXiv, May 11, 2023. Accessed: May 16, 2023. [Online]. Available:
http://arxiv.org/abs/2305.06677
[44] S. M. Xie et al., “DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.” arXiv, May
17, 2023. Accessed: May 19, 2023. [Online]. Available: http://arxiv.org/abs/2305.10429
[45] I. R. McKenzie et al., “Inverse Scaling: When Bigger Isn’t Better.” arXiv, Jun. 15, 2023. Accessed: Jun. 21,
2023. [Online]. Available: http://arxiv.org/abs/2306.09479
[46] C.-Y. Hsieh et al., “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training
Data and Smaller Model Sizes.” arXiv, May 03, 2023. doi: 10.48550/arXiv.2305.02301.
[47] R. Eldan and Y. Li, “TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?” arXiv, May 12, 2023. Accessed: May 16, 2023. [Online]. Available: http://arxiv.org/
abs/2305.07759
[48] C. Xu, Y. Xu, S. Wang, Y. Liu, C. Zhu, and J. McAuley, “Small Models are Valuable Plug-ins for Large
Language Models.” arXiv, May 15, 2023. Accessed: May 16, 2023. [Online]. Available: http://arxiv.org/
abs/2305.08848
[49] F. Brahman et al., “PlaSma: Making Small Language Models Better Procedural Knowledge Models for
(Counterfactual) Planning.” arXiv, May 30, 2023. Accessed: Jun. 01, 2023. [Online]. Available: http://
arxiv.org/abs/2305.19472
[50] Z. Guo, P. Wang, Y. Wang, and S. Yu, “Dr. LLaMA: Improving Small Language Models in Domain-
Specic QA via Generative Data Augmentation.” arXiv, May 12, 2023. Accessed: May 16, 2023. [Online].
Available: http://arxiv.org/abs/2305.07804
[51] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and P. J. Liu, “SLiC-HF: Sequence Likelihood Calibration
with Human Feedback.” arXiv, May 17, 2023. Accessed: May 19, 2023. [Online]. Available: http://arxiv.
org/abs/2305.10425
[52] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “RRHF: Rank Responses to Align Language
Models with Human Feedback without tears.” arXiv, Apr. 11, 2023. Accessed: Apr. 25, 2023. [Online].
Available: http://arxiv.org/abs/2304.05302
[53] C. Zhou et al., “LIMA: Less Is More for Alignment.” arXiv, May 18, 2023. Accessed: May 22, 2023. [Online].
Available: http://arxiv.org/abs/2305.11206
[54] A. Bietti, V. Cabannes, D. Bouchacourt, H. Jegou, and L. Bottou, “Birth of a Transformer: A Memory
Viewpoint.” arXiv, Jun. 01, 2023. Accessed: Jun. 05, 2023. [Online]. Available: http://arxiv.org/
abs/2306.00802
[55] W. Wang et al., “Augmenting Language Models with Long-Term Memory.” arXiv, Jun. 12, 2023.
Accessed: Jun. 14, 2023. [Online]. Available: http://arxiv.org/abs/2306.07174
[56] X. Liang et al., “Unleashing Innite-Length Input Capacity for Large-scale Language Models with
Self-Controlled Memory System.” arXiv, Apr. 26, 2023. Accessed: Apr. 28, 2023. [Online]. Available:
http://arxiv.org/abs/2304.13343
[57] J. Kang, R. Laroche, X. Yuan, A. Trischler, X. Liu, and J. Fu, “Think Before You Act: Decision Transformers
with Internal Working Memory.” arXiv, May 23, 2023. Accessed: May 30, 2023. [Online]. Available:
http://arxiv.org/abs/2305.16338
[58] W. Zhong, L. Guo, Q. Gao, and Y. Wang, “MemoryBank: Enhancing Large Language Models with
Long-Term Memory.” arXiv, May 18, 2023. Accessed: May 22, 2023. [Online]. Available: http://arxiv.org/
abs/2305.10250
[59] Y. Zeng et al., “What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?” arXiv,
Jul. 05, 2023. Accessed: Jul. 07, 2023. [Online]. Available: http://arxiv.org/abs/2307.02469
[60] W. Berrios, G. Mittal, T. Thrush, D. Kiela, and A. Singh, “Towards Language Models That Can See:
Computer Vision Through the LENS of Natural Language.” arXiv, Jun. 28, 2023. Accessed: Jun. 29,
2023. [Online]. Available: http://arxiv.org/abs/2306.16410
[61] C. Lyu et al., “Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text
Integration.” arXiv, Jun. 15, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/
abs/2306.09093
[62] R. Girdhar et al., “ImageBind: One Embedding Space To Bind Them All.” arXiv, May 31, 2023. doi:
LLM Strategy Guide
41
10.48550/arXiv.2305.05665.
[63] L. Xue et al., “ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding.” arXiv, May 14,
2023. Accessed: May 16, 2023. [Online]. Available: http://arxiv.org/abs/2305.08275
[64] C. Yeh, Y. Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg, “AttentionViz: A Global View of
Transformer Attention.” arXiv, May 04, 2023. Accessed: May 08, 2023. [Online]. Available: http://arxiv.
org/abs/2305.03210
[65] J. Copet et al., “Simple and Controllable Music Generation.” arXiv, Jun. 08, 2023. Accessed: Jun. 14,
2023. [Online]. Available: http://arxiv.org/abs/2306.05284
[66] M. S. Ozdayi et al., “Controlling the Extraction of Memorized Data from Large Language Models via
Prompt-Tuning.” arXiv, May 19, 2023. Accessed: May 22, 2023. [Online]. Available: http://arxiv.org/
abs/2305.11759
[67] H. Zhang, M. Dang, N. Peng, and G. V. den Broeck, “Tractable Control for Autoregressive Language
Generation.” arXiv, Apr. 17, 2023. Accessed: Apr. 20, 2023. [Online]. Available: http://arxiv.org/
abs/2304.07438
[68] T. Zhang, Y. Zhang, V. Vineet, N. Joshi, and X. Wang, “Controllable Text-to-Image Generation
with GPT-4.” arXiv, May 29, 2023. Accessed: May 31, 2023. [Online]. Available: http://arxiv.org/
abs/2305.18583
[69] J. Hewitt, J. Thickstun, C. D. Manning, and P. Liang, “Backpack Language Models.” arXiv, May 26, 2023.
Accessed: May 30, 2023. [Online]. Available: http://arxiv.org/abs/2305.16765
[70] H. Zhang, M. Dang, N. Peng, and G. V. den Broeck, “Tractable Control for Autoregressive Language
Generation.” arXiv, Apr. 17, 2023. Accessed: Apr. 20, 2023. [Online]. Available: http://arxiv.org/
abs/2304.07438
[71] S. Chen, S. Gao, and J. He, “Evaluating Factual Consistency of Summaries with Large Language
Models.” arXiv, May 23, 2023. Accessed: May 27, 2023. [Online]. Available: http://arxiv.org/
abs/2305.14069
[72] P. Laban et al., “LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond.” arXiv,
May 23, 2023. Accessed: May 25, 2023. [Online]. Available: http://arxiv.org/abs/2305.14540
[73] S. Zheng, J. Huang, and K. C.-C. Chang, “Why Does ChatGPT Fall Short in Answering Questions
Faithfully?” arXiv, Apr. 20, 2023. Accessed: Apr. 25, 2023. [Online]. Available: http://arxiv.org/
abs/2304.10513
[74] A. Borji, “A Categorical Archive of ChatGPT Failures.” arXiv, Feb. 18, 2023. doi: 10.48550/
arXiv.2302.03494.
[75] Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, and I. Szpektor, “TrueTeacher: Learning Factual
Consistency Evaluation with Large Language Models.” arXiv, May 18, 2023. Accessed: May 22, 2023.
[Online]. Available: http://arxiv.org/abs/2305.11171
[76] J. Kirchenbauer et al., “On the Reliability of Watermarks for Large Language Models.” arXiv, Jun. 09,
2023. Accessed: Jun. 13, 2023. [Online]. Available: http://arxiv.org/abs/2306.04634
[77] E. Mavroudi, T. Afouras, and L. Torresani, “Learning to Ground Instructional Articles in Videos through
Narrations.” arXiv, Jun. 06, 2023. Accessed: Jun. 07, 2023. [Online]. Available: http://arxiv.org/
abs/2306.03802
[78] Z. Lin, S. Trivedi, and J. Sun, “Generating with Condence: Uncertainty Quantication for Black-box
Large Language Models.” arXiv, May 30, 2023. Accessed: Jun. 05, 2023. [Online]. Available: http://arxiv.
org/abs/2305.19187
[79] J. Yu et al., “KoLA: Carefully Benchmarking World Knowledge of Large Language Models.” arXiv, Jun.
15, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/abs/2306.09296
[80] R. Cohen, M. Hamri, M. Geva, and A. Globerson, “LM vs LM: Detecting Factual Errors via Cross
Examination.” arXiv, May 22, 2023. Accessed: May 24, 2023. [Online]. Available: http://arxiv.org/
abs/2305.13281
[81] T. Zhang et al., “Interpretable Unied Language Checking.” arXiv, Apr. 07, 2023. Accessed: Apr. 12,
2023. [Online]. Available: http://arxiv.org/abs/2304.03728
[82] P. Manakul, A. Liusie, and M. J. F. Gales, “SelfCheckGPT: Zero-Resource Black-Box Hallucination
Detection for Generative Large Language Models.” arXiv, Mar. 15, 2023. Accessed: Mar. 22, 2023.
[Online]. Available: http://arxiv.org/abs/2303.08896
appliedAI
42
[83] B. Peng et al., “Check Your Facts and Try Again: Improving Large Language Models with External
Knowledge and Automated Feedback.” arXiv, Mar. 01, 2023. Accessed: Mar. 02, 2023. [Online].
Available: http://arxiv.org/abs/2302.12813
[84] Z. Luo et al., “Augmented Large Language Models with Parametric Knowledge Guiding.” arXiv, May
08, 2023. Accessed: May 12, 2023. [Online]. Available: http://arxiv.org/abs/2305.04757
[85] Y. Xi et al., “Towards Open-World Recommendation with Knowledge Augmentation from Large
Language Models.” arXiv, Jun. 19, 2023. Accessed: Jun. 26, 2023. [Online]. Available: http://arxiv.org/
abs/2306.10933
[86] L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “ChatGPT is not Enough: Enhancing Large Language Models
with Knowledge Graphs for Fact-aware Language Modeling.” arXiv, Jun. 20, 2023. Accessed: Jun. 26,
2023. [Online]. Available: http://arxiv.org/abs/2306.11489
[87] N. Mündler, J. He, S. Jenko, and M. Vechev, “Self-contradictory Hallucinations of Large Language
Models: Evaluation, Detection and Mitigation.” arXiv, May 25, 2023. Accessed: May 30, 2023. [Online].
Available: http://arxiv.org/abs/2305.15852
[88] T. Gao, H. Yen, J. Yu, and D. Chen, “Enabling Large Language Models to Generate Text with Citations.”
arXiv, May 23, 2023. Accessed: May 27, 2023. [Online]. Available: http://arxiv.org/abs/2305.14627
[89] X. Li et al., “Chain of Knowledge: A Framework for Grounding Large Language Models with
Structured Knowledge Bases.” arXiv, May 22, 2023. Accessed: May 25, 2023. [Online]. Available: http://
arxiv.org/abs/2305.13269
[90] K. Bhatia, A. Narayan, C. De Sa, and C. Ré, “TART: A plug-and-play Transformer module for task-
agnostic reasoning.” arXiv, Jun. 13, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/
abs/2306.07536
[91] M. Kwon, H. Hu, V. Myers, S. Karamcheti, A. Dragan, and D. Sadigh, “Toward Grounded Social
Reasoning.” arXiv, Jun. 14, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/
abs/2306.08651
[92] A. Piktus, “Online tools help large language models to solve problems through reasoning,” Nature,
May 2023, doi: 10.1038/d41586-023-01411-4.
[93] E. Kıcıman, R. Ness, A. Sharma, and C. Tan, “Causal Reasoning and Large Language Models: Opening
a New Frontier for Causality.” arXiv, Apr. 28, 2023. Accessed: May 04, 2023. [Online]. Available: http://
arxiv.org/abs/2305.00050
[94] G. Poesia, K. Gandhi, E. Zelikman, and N. D. Goodman, “Certied Reasoning with Language Models.”
arXiv, Jun. 06, 2023. Accessed: Jun. 14, 2023. [Online]. Available: http://arxiv.org/abs/2306.04031
[95] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant, “Answering Questions by Meta-
Reasoning over Multiple Chains of Thought.” arXiv, Apr. 25, 2023. Accessed: Apr. 26, 2023. [Online].
Available: http://arxiv.org/abs/2304.13007
[96] P. Lu et al., “Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models.”
arXiv, Apr. 19, 2023. Accessed: Apr. 24, 2023. [Online]. Available: http://arxiv.org/abs/2304.09842
[97] S. Bubeck et al., “Sparks of Articial General Intelligence: Early experiments with GPT-4.” arXiv, Apr. 13,
2023. Accessed: May 03, 2023. [Online]. Available: http://arxiv.org/abs/2303.12712
[98] S. Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv, May
17, 2023. Accessed: May 22, 2023. [Online]. Available: http://arxiv.org/abs/2305.10601
[99] J. Lanchantin, S. Toshniwal, J. Weston, A. Szlam, and S. Sukhbaatar, “Learning to Reason and
Memorize with Self-Notes.” arXiv, May 01, 2023. doi: 10.48550/arXiv.2305.00833.
[100] Y. Xie et al., “OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities.” arXiv, May 23,
2023. Accessed: May 30, 2023. [Online]. Available: http://arxiv.org/abs/2305.16334
[101] H. Zhang et al., “Building Cooperative Embodied Agents Modularly with Large Language Models.”
arXiv, Jul. 05, 2023. Accessed: Jul. 07, 2023. [Online]. Available: http://arxiv.org/abs/2307.02485
[102] D. Gao et al., “AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn.”
arXiv, Jun. 14, 2023. Accessed: Jun. 21, 2023. [Online]. Available: http://arxiv.org/abs/2306.08640
[103] Z. Zhang, X. Zhang, W. Xie, and Y. Lu, “Responsible Task Automation: Empowering Large Language
Models as Responsible Task Automators.” arXiv, Jun. 01, 2023. Accessed: Jun. 05, 2023. [Online].
Available: http://arxiv.org/abs/2306.01242
[104] G. Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models.” arXiv, May
LLM Strategy Guide
43
25, 2023. doi: 10.48550/arXiv.2305.16291.
[105] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large Language Models as Tool Makers.” arXiv, May 26,
2023. Accessed: May 30, 2023. [Online]. Available: http://arxiv.org/abs/2305.17126
[106] C. Colas, L. Teodorescu, P.-Y. Oudeyer, X. Yuan, and M.-A. Côté, “Augmenting Autotelic Agents with
Large Language Models.” arXiv, May 21, 2023. Accessed: May 23, 2023. [Online]. Available: http://arxiv.
org/abs/2305.12487
[107] L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to Use Large Language Models While Reducing Cost
and Improving Performance.” arXiv, May 09, 2023. Accessed: May 10, 2023. [Online]. Available: http://
arxiv.org/abs/2305.05176
[108] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “CodeGen2: Lessons for Training LLMs
on Programming and Natural Languages.” arXiv, May 03, 2023. Accessed: May 09, 2023. [Online].
Available: http://arxiv.org/abs/2305.02309
[109] Y. Wang, H. Le, A. D. Gotmare, N. D. Q. Bui, J. Li, and S. C. H. Hoi, “CodeT5+: Open Code Large Language
Models for Code Understanding and Generation.” arXiv, May 13, 2023. Accessed: May 16, 2023.
[Online]. Available: http://arxiv.org/abs/2305.07922
[110] Y. Cai et al., “Low-code LLM: Visual Programming over LLMs.” arXiv, Apr. 17, 2023. doi: 10.48550/
arXiv.2304.08103.
[111] J. Yang et al., “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.” arXiv,
Apr. 26, 2023. Accessed: Apr. 27, 2023. [Online]. Available: http://arxiv.org/abs/2304.13712
[112] B. Lester, R. Al-Rfou, and N. Constant, “The Power of Scale for Parameter-Efcient Prompt Tuning,” in
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online
and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp.
3045–3059. doi: 10.18653/v1/2021.emnlp-main.243.
[113] X. L. Li and P. Liang, “Prex-Tuning: Optimizing Continuous Prompts for Generation.” arXiv, Jan. 01,
2021. doi: 10.48550/arXiv.2101.00190.
[114] Z. Hu et al., “LLM-Adapters: An Adapter Family for Parameter-Efcient Fine-Tuning of Large
Language Models.” arXiv, Apr. 04, 2023. Accessed: Apr. 11, 2023. [Online]. Available: http://arxiv.org/
abs/2304.01933
[115] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv, Oct. 16, 2021. doi:
10.48550/arXiv.2106.09685.
[116] P. Gao et al., “LLaMA-Adapter V2: Parameter-Efcient Visual Instruction Model.” arXiv, Apr. 28, 2023.
Accessed: May 02, 2023. [Online]. Available: http://arxiv.org/abs/2304.15010
[117] R. Zhang et al., “LLaMA-Adapter: Efcient Fine-tuning of Language Models with Zero-init Attention.”
arXiv, Mar. 28, 2023. Accessed: Apr. 03, 2023. [Online]. Available: http://arxiv.org/abs/2303.16199
[118] Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei, “INT2.1: Towards Fine-Tunable Quantized
Large Language Models with Error Correction through Low-Rank Adaptation.” arXiv, Jun. 13, 2023. doi:
10.48550/arXiv.2306.08162.
[119] L. Chen, J. Chen, T. Goldstein, H. Huang, and T. Zhou, “InstructZero: Efcient Instruction Optimization
for Black-Box Large Language Models.” arXiv, Jun. 05, 2023. Accessed: Jun. 06, 2023. [Online].
Available: http://arxiv.org/abs/2306.03082
[120] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efcient Finetuning of Quantized
LLMs.” arXiv, May 23, 2023. Accessed: May 25, 2023. [Online]. Available: http://arxiv.org/abs/2305.14314
[121] S. Malladi et al., “Fine-Tuning Language Models with Just Forward Passes.” arXiv, May 26, 2023.
Accessed: May 30, 2023. [Online]. Available: http://arxiv.org/abs/2305.17333
[122] Y. Chang et al., “A Survey on Evaluation of Large Language Models.” arXiv, Jul. 06, 2023. Accessed: Jul.
07, 2023. [Online]. Available: http://arxiv.org/abs/2307.03109
[123] D. Kiela et al., “Dynabench: Rethinking Benchmarking in NLP.” arXiv, Apr. 07, 2021. Accessed: Apr. 13,
2023. [Online]. Available: http://arxiv.org/abs/2104.14337
[124] Z. Ma et al., “Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation
Benchmarking.” arXiv, May 20, 2021. doi: 10.48550/arXiv.2106.06052.
appliedAI
44
Authors
Paul Yu-Chun Chang works as an AI Expert
specializing in Large Language Models at
appliedAI Initiative GmbH. He has 10 years
of interdisciplinary research experience
in computational linguistics, cognitive
neuroscience, and AI, and 5 years of industrial
experience in developing AI algorithms in
language modeling and image analytics.
Paul holds a PhD from LMU Munich, where
he integrated NLP and machine learning
methods to study brain language cognition.
Bernhard Pugfelder works as Head of Use
Cases and Applications at the appliedAI
Initiative GmbH. Bernhard has 15 years of
experience in the elds of Data Science,
Natural Language Processing (NLP), as well
as data and AI across different companies
such as BMW Group or Volkswagen Group. He
is renowned for his expertise especially in the
eld of AI in general, NLP and Generative AI in
particular.
Dr. Paul Yu-Chun Chang
AI Expert: Foundation Models -
Large Language Models,
appliedAI Initiative GmbH
Bernhard Pugfelder
Head of Use Cases and Applications,
appliedAI Initiative GmbH
LLM Strategy Guide
45
Contributors
Simon-Pierre Genot
Senior Manager AI Strategy,
Inneon Technologies
simon.genot@inneon.com
Simon works as a Senior Manager AI Strategy
at Inneon, where he is responsible for
AI strategy and use case development.
Previously, Simon worked in machine
learning research for IBM Research in the USA
before transitioning to the strategic side by
launching the rst AI initiative at BayWa.
Dr. Mark Buckley
Research scientist,
Siemens AG
Mark holds a PhD in computational linguistics
from Saarland University, where he worked
on machine learning methods for dialogue
systems. He joined Siemens Technology
as a research scientist for industrial NLP in
2015, working on low-resource NLP, domain
adaptation and the interface of structured
and unstructured data.
Dr. Philipp Hartmann
Director of AI Strategy,
appliedAI Initiative GmbH
Philipp Hartmann serves appliedAI as
Director of AI Strategy at the appliedAI
Initiative GmbH. Prior to joining appliedAI,
he spent four years at McKinsey&Company
as a strategy consultant. Philipp holds a PhD
from Technical University of Munich where
he investigated factors of competitive
advantage in Articial Intelligence.
Mingyang Ma
Senior AI Strategist,
appliedAI Initiative GmbH
Mingyang Ma works as Senior AI Strategist
at the appliedAI Initiative GmbH, supporting
all partner companies’ decision making
and technical solution identication of
various AI use cases, with a particular focus
on leveraging LLMs. With over 6 years of
expertise in NLP, Mingyang has excelled in the
realm of Conversational AI, demonstrating
her prociency in application DevOps
and platform development across various
processes during her tenure at BMW Group in
both Germany and the USA.
appliedAI
46
About appliedAI Initiative GmbH
appliedAI is Europe's largest initiative for the
application of trusted AI technology. The initiative was
established in 2017 by Dr. Andreas Liebl as a division
of UnternehmerTUM Munich and transferred to a joint
venture with Innovation Park Articial Intelligence (IPAI)
Heilbronn in 2022.
At the Munich and Heilbronn ofces, more than 100
employees pursue the goal of making the European
industry a shaper in the AI era in order to maintain
Europe's competitiveness and actively shape the
future.
appliedAI holistically supports international
corporations, including BMW and Siemens, as well as
medium-sized companies in their AI transformation.
This is accomplished through partnership-
based exchange and joint knowledge building,
comprehensive accelerator programs, and specic
solutions and services.
For more information, please visit
https://www.appliedai.de/en/
Embedded in
global AI networks
Represented in
OECD & GPAI
Advancing the Industry on their AI Journey based on holistic
frameworks: Europe's largest initiative for the application of
cutting edge trustworthy AI. With our ecosystem, we strengthen
and build the next AI champions in Europe
Advancing Europe's industry to compete in the age of
AI, shaping a future that we desire to live in
Value created
Not started
AI is not on the
company’s agenda
Experimenter
Impulse to start with AI is
established, first prototypes
are built
Practitioner
AI vision is known and
systematic implementation
has started
Professional
AI is in production | in use
and broadly embedded in
the organization
Shaper
Organizational DNA
is transformed
Level
100+
Experts
300+
Partners in
ecosystem
200+
Companies
supported
50+
AI
applications
25+
AI strategies
2.5K+
Trained in live
sessions
45K+
Trained in online
courses
Partnership Solutions & Services Programmes Ventures
+++
LLM Strategy Guide
47
Acknowledgement
The content presented in Chapter 2 “Make or To Buy:
Leveraging Large Language Models in Business
is based upon the invaluable insights and research
ndings derived from the publication titled "AI Insights
- Making business decisions in the realm of Large
Language Models," authored by Dr. Philip Hutchinson
and Bernhard Pugfelder and published by the
appliedAI Institute for Europe gGmbH. We extend our
sincere appreciation to the authors for their work that
has served as a fundamental reference and inspiration.
Their expertise and dedication have played a crucial
role in shaping the ideas and understanding presented
herein.
The content presented in this white paper has been
inuenced and inspired by discussions and exchanges
within the appliedAI Working Group “Large Language
Models including appliedAI industry partners such as
BMW Group, Giesecke+Devrient GmbH, EnBW Energie
Baden-Württemberg AG, Inneon Technologies AG,
Miele & Cie. KG, Munich Re Group, Rohde & Schwarz
GmbH & Co. KG, Siemens AG. The collective expertise,
exchange and dedication to advancing the knowledge
in Generative AI was a great inspiration throughout the
process of creating this white paper.
A Guide for Large Language
Model Make-or-Buy Strategies:
Business and Technical Insights
appliedAI Initiative GmbH
Freddie Mercury Street 5
80797 Munich
Germany
www.appliedai.de