Home » AI » Exploring the World of Large Language Models: From Evaluation to Implementation

Exploring the World of Large Language Models: From Evaluation to Implementation

Torsten Reidt, member of our Cactai Team, delves in this article into the intricacies of the most prominent large language model (LLM), highlighting its pros and cons, complexity, costs, and deployment models for comprehensive understanding and analysis.

Large Language Models (LLMs) are transforming the business landscape, automating customer service, and unlocking predictive insights from vast datasets. As pioneers like Meta, OpenAI, and Google push the boundaries of AI innovation, the LLM landscape is evolving at an unprecedented pace. With breakthroughs and announcements emerging daily, it’s essential to stay ahead of the curve.

In this post, we’ll look closer at three leading LLMs – Meta’s just recently released LLama3 70B, OpenAI’s GPT-4, and MistralAI’s Mixtral 8x22B. We will be comparing their key facts, capabilities and limitations to bring new business, enhance customer engagement or transform the company operations.

Before we start with a short description of each considered LLM, let’s take a moment to familiarize ourselves with some features in the area of LLMs:

Open source: Free access, modification, and distribution of code/model, with collaborative development
Closed source: Proprietary code/ model, restricted access, and modifications only by the owner
Context window: refers to the maximum amount of input text that a model can process and consider when generating a response or making predictions.
Parameter: refers to the number of learnable variables (weights) of an LLM. In general, the number of parameters defines the size of the LLM. As of now, the more parameters a model has, the more capacity it has to learn and represent complex patterns in the data.
Function calling: refers to the ability of the LLM to detect if external tools (API calls, custom functions…) are necessary to fulfil the given task and ultimately call the needed external tools.
Open weight: refers to releasing only the pre-trained parameters or weights of the LLM itself. This allows others to use the model for inference and fine-tuning. However, the training code, original dataset, model architecture details, and training methodology are not provided.

With that out of the way, here is a short description of the contenders:

openAI GPT-4 ^{(March 14, 2023)}

GPT-4 is not the latest model available from openAI but has been chosen for the comparison due to data availability. The model can be accessed via the openAI API. There is no official information available concerning the model architecture or parameter count, but unofficial sources suggest a Mixture of Expert architecture (a combination of multiple specialized LLM) with a total of 1.8T parameters. GPT-4 is optimized for the use in English language but can ingest text in different languages and respond accordingly. It features a context window of 64k tokens.

mistralAI Mixtral 8x22B

The latest open-source model from Mistral AI. It is a sparse Mixture-of-Experts model with a total of 141B parameters. It features fluency in English, French, Italian, German and Spanish and has a context window of 64k tokens.

Meta Llama3 70B

As already mentioned, this is one of the latest models of the Meta Llama family. Equipped with a context window of 8k tokens using 70B parameters. 5% of the training data consisted of non-English data covering 30 languages, so certain multilingual capabilities are given.

	GPT-4	Mixtral 8x22B	Llama3 70B
Open source	no	yes	yes
Parameter count	1.8T (unofficial	141B	70B
Context window	64k token	64k token	8k token
Language support	English, capabilities in other languages	English, French, Italian, German, Spanish	English, capabilities in other languages
Function calling	yes	yes	yes

How are Large Language Models evaluated?

Evaluating Large Language Models (LLMs) objectively can be a complex task, as it requires assessing their performance across various aspects. However, there are some common benchmarks which are often used to rank the models relative to each other.

MMLU (Massive Multi-task Language Understanding) assesses the multitask accuracy of the models.
AGIEval assesses the models in the context of human-centric standardized exams, such as math exams.
BIG-bench (Beyond the Imitation Game Benchmark), focuses on tasks that are believed to be beyond the capabilities of the current language model
ARC-Challenge (Abstraction and Reasoning Corpus) assesses human-like general fluid intelligence
DROP (Discrete Reasoning Over Paragraphs) assesses reading comprehension capabilities

Here is how the three models performed on the mentioned benchmarks:

	GPT-4 ^source	Mixtral 8x22B ^source	Llama3 70B ^source
MMLU _{5 shot}	86.4	77.7	79.5
AGIEval English _3-5-shot	–	61.2	63.0
BIG-Bench Hard _{3-shot CoT}	83.1 ^source	79.2	81.3
ARC-Challenge _25-shot	96.3	90.7	93.0
DROP _{3-shot F1}	80.9	77.6	79.7

So now that we have the numbers clear, we always want to choose the model which performs best on the majority of the benchmarks, right? Well, the answer to that is not that simple.

The use case is decisive

While a high score on benchmark tests can indicate how well a model generalizes to unseen data for a given task, your use case is the most important consideration for model selection. Imagine for example the following two points.

The task you want the model to perform well on is different than the benchmark
Your data is different to the data used in the benchmark, if you use a different language than English, for example.

In both cases, there might be other models available which could perform better for your problem setting. In addition to the performance-related criteria, other aspects are worth considering when choosing the LLM. Let’s revisit the three LLMs selected for this article, along with a practical example for each.

GPT-4:

Business case: e-commerce platform
Use case: A highly conversational and engaging customer chatbot in the English language.

Mixtral 8x22B:

Business case: Online translation platform that needs to support various languages, including some lesser-resourced languages.
Use case: Backbone for translation due to multilingual capabilities. Can be fine-tuned due to its open-source nature.

Llama3 70B:

Business case: Large-scale text classification system that needs to process millions of documents daily
Use case: LLama3 70B’s efficient architecture and optimized performance combined with the cost cost-effectiveness. Can be fine-tuned if needed.

Licensing and Cost [inference]

GPT-4 is a closed-source LLM and the use via the openAI API is bound to a fixed price per 1M token. Since LLama3 70B and Mixtral 8x22B are open-source models, there is no cost per token, the cost much rather depends on how the models are deployed. For comparison, deployment options based on price per 1M token have been selected.

	GPT-4 ^source	Mixtral 8x22B ^source	Llama3 70B ^source
Input _{1M tokens}	30$	2$	1$
Output _{1M tokens}	60$	6$	1$

Deployment

The choice between cloud and on-premises deployment for LLMs should be guided by the specific needs and capabilities of the organization, balancing factors like cost, control, scalability, and security. Each deployment option comes with its distinct advantages and challenges. This section applies to the models Mixtral 8x22B and Llama3 70B only as the openAI GPT-4 is a closed-source model.

Cloud based platforms

Deploying LLMs in the cloud involves utilizing the computational power and resources of a cloud service provider. This approach offers scalability, as businesses can easily adjust their usage based on demand without the need for upfront investment in physical hardware. Cloud deployment also ensures that updates and maintenance are managed by the provider, reducing the IT burden on the company. However, this model depends heavily on internet connectivity and can raise concerns regarding data security and privacy, as sensitive information is processed and stored off-site.

An LLM can be trained/deployed or hosted in several available options such as:

Amazon SageMaker
Google Cloud AI Platform
Microsoft Azure Machine Learning

The choice of which one to pick depends, among other things, on existing infrastructure or tools in your company. If you already use Amazon for other applications, maybe you don’t want to add a different provider for the deployment of the LLM. Other points to consider would be preferred frameworks or specific needs.

The cost for deployment depends on many factors such as availability or data volume. The driving cost factor however is the size of the chosen LLM which ultimately defines the hardware (GPU) needed.

A rough estimation for inference with the Meta Llama3 model, quantized to 4 bits (which means reduced in size with some performance losses) is about 5$ per hour on the Amazon SageMaker “ml.g4dn.12xlarge” instance. This instance provides 48GB of GPU memory and can be used for inference. For fine-tuning or training of the LLM, an instance with better performance should be used.

On-premises deployment

On-premises deployment involves setting up the LLM infrastructure within a company’s local environment. This approach gives organizations full control over their data, enhancing security and compliance with regulations, particularly critical for industries like healthcare and finance. On-premises solutions also allow for customization that might be necessary for specific organizational needs. The drawbacks include higher initial costs for hardware and infrastructure, as well as the need for ongoing maintenance and technical support, which can be resource-intensive.

The cost of a typical Deep Learning Workstation starts at a price of around €7,000. Such a workstation is equipped often with two consumer-grade GPUs, although it depends on the actual requirements and the purpose of the deployment (trained model? inference use?). However, it’s essential to also consider the software and overall configuration, as well as ongoing maintenance and upgrade needs, to ensure optimal performance.

Data Privacy and Security

Both data privacy (referring to the rights and governance surrounding personal data) and data security (referring to the measures and technologies used to protect data from unauthorized access, breaches, and theft) are foundational to building trust in technological systems. They require ongoing attention and adaptation to evolving threats and regulatory landscapes. Ensuring that both privacy and security are prioritized is essential for safeguarding the rights and interests of all stakeholders involved in the digital ecosystem. A deployed LLM should be treated as any other deployed application concerning unauthorized access, data breaches and cyber threats. In addition, the data privacy has to be looked at closely. Some providers use the user input for training purposes which could lead to unwanted data leakage.

Conclusion

The landscape of Large Language Models (LLMs) is rapidly evolving, with new functionalities emerging almost daily. Each model comes with its unique strengths and weaknesses. We have compared three prominent LLMs to illustrate key considerations for leveraging their powerful capabilities in driving digital transformation, enhancing customer experiences, and uncovering hidden insights. As the AI landscape advances, it’s evident that those who effectively utilize LLMs will gain a competitive edge.

At Cactus, our dedicated CactAI team is enthusiastic about exploring the optimal AI solutions tailored to your unique business needs, partnering with you to identify the most effective Large Language Model that aligns with your specific use case, thereby accelerating your business growth and enhancing your operational efficiency. Let us help you harness the full potential of AI to drive innovation and achieve competitive advantages in your industry.

Share this page

If there is a project needing help or even a skill set you are missing, contact us.

Exploring the World of Large Language Models: From Evaluation to Implementation

openAI GPT-4 ^{(March 14, 2023)}

mistralAI Mixtral 8x22B

Meta Llama3 70B

How are Large Language Models evaluated?

The use case is decisive

Licensing and Cost [inference]

Deployment

Cloud based platforms

On-premises deployment

Data Privacy and Security

Conclusion

Similar Articles

Space Technology for Belgian Companies: Embrace for Innovation and Growth

Applying Emerging Tech for Real-World Impact: Why IoT and AI Matter Today

How to Succeed in Applying for the 2025 Horizon Europe Calls

“Strategic Innovation means aligning new technologies with long-term goals and customer value”

Digital transformation in sustainable and efficient urban pest management

The evolution of PropTech: From websites to AI-driven innovation

Exploring the World of Large Language Models: From Evaluation to Implementation

openAI GPT-4 (March 14, 2023)

mistralAI Mixtral 8x22B

Meta Llama3 70B

How are Large Language Models evaluated?

The use case is decisive

Licensing and Cost [inference]

Deployment

Cloud based platforms

On-premises deployment

Data Privacy and Security

Conclusion

Similar Articles

Space Technology for Belgian Companies: Embrace for Innovation and Growth

Applying Emerging Tech for Real-World Impact: Why IoT and AI Matter Today

How to Succeed in Applying for the 2025 Horizon Europe Calls

“Strategic Innovation means aligning new technologies with long-term goals and customer value”

Digital transformation in sustainable and efficient urban pest management

The evolution of PropTech: From websites to AI-driven innovation

openAI GPT-4 ^{(March 14, 2023)}