AI Use Cases

From training to inference, understanding how to deploy AI models

Posted on: 2025-03-15

It's fair to say that AI is all over the news, and if you work in this field like I do, it can be hard to keep track of new AI models and tools coming out on an almost daily basis. There are a lot of options from Large Language Models (LLMs) like ChatGPT, to specialized vision models, tools that allow you to edit photos and videos, text-to-speech options to clone voices, image recognition software, and so on. It can also be hard to pin-point how easy or cheap a specific use case can be.

For example, I hear all the time people making the claim that Deepseek is open source, which means anyone can self-host the full model on their laptop, which is patently false. In this short primer I will attempt to give some basic information on how to navigate AI models and how to pick the right tool based on your use-case.

Types of models

First, let's take a look at the various types of models available:

Language Models (LLMs): These are used for text generation. These models take an input prompt, convert that prompt into tokens, process those tokens in their neural network, then give you what they think is the most likely answer to your prompt. This is what you use when you want to do creative writing, summarization or coding assistance. Some of the models include GPT-4 (used by ChatGPT), Claude Code and Llama 3.
Vision models: These are designed for image processing and classification, object detection and video creation. Some examples are CLIP, Stable Diffusion and SAM.
Speech and audio models: These models convert speech to text or synthetize voices. They can also be used for conversation or as audio assistants. Some examples include VALL-E and Whisper.
Multimodal systems: In recent weeks, multimodal models have become the most popular options. These are models that can deal with multiple types of data. For example, GPT-4o is a multimodal system that can handle text, images and audio by default, which earlier versions could not.

On top of the type of model used, there are also additional features that a model may have. For example, the big improvement that Deepseek brought to the industry is its reasoning ability. Now, new LLMs all feature some kind of reasoning or deep thought function. The difference is that the model now uses step-by-step reasoning or chain-of-thought prompting, rather than generating text based on statistical patterns. It makes them better at logical questions like mathematics and sciences.

Model size considerations

The size of a model is a crucial concept to understand. They are typically categorized based on how many parameters are used to train the model. AI models tend to range from small (a few hundred million parameters) to massive (100+ billion parameters). Size impacts how much resources the model will require:

Small models (100M - 1B): These are small and fast models that can be run on mobile phones or low power computers. They can be used for image tagging, text classification or speech-to-text functions. Some examples include DistilBERT (66M) and Whisper (39M). While they can be very effective at precise tasks, if you were to try and build a chatbot with such a model, you would quickly realize how unusable it is.
Medium models (1B - 10B): These models can handle more complex tasks like powering up a simple chat bot, giving recommendations or doing anomaly detection. These are by far the most common self-hosted models and they can be run on a powerful gaming computer. For example, Mistral (7B), Llama 3.2 (3B) and Gemma (7B) can all be run on a PC with an 8GB GPU.
Large models (10B - 50B): These are high quality models that can do text generation and image processing. To run them, you need a powerful server with a high-end GPU or multiple TPUs. While they can be within reach of some, they are typically best used as cloud services. Some examples include Llama 3.3 (70B) and Phi4 (14B).
Very large models (50B+): These are the most high end, latest generation models that you interact with when you use one of the popular AI apps running models like GPT-4 (1T+) or Deepseek-r1 (671B). To run the full Deepseek-r1 model, you would need 1.34TB of VRAM and preferably 2TB+ of system memory. This means about 32x nVidia H100 80GB GPU cards for the FP16 model.

One additional concept when it comes to model size is quantization. This is a technique used to reduce the memory and compute requirements for running an AI model, by reducing the precision of the model. Instead of using the full 16-bit or 32-bit floating point numbers (FP16 or FP32), quantized models use 8-bit, 4-bit or even lower precision representations. This significantly reduces the amount of VRAM needed to run a model, but it also reduces performance as a result. For example, if you go to the Ollama models page and search for Deepseek, you can see that the model is available in its full 671B version, but also in much smaller 70B, 32B, 14B, 8B, 7B and even 1.5B versions. The lower you go, the less resources the model needs, but also the dumber it ends up being.

Training, fine-tuning and RAG

There are two distinct phases in the use of AI: training and inference. Training is when you build the model by using a ton of data, while inference is when you ask questions. One of the biggest misconceptions in AI is that every use case requires training a model from scratch. In reality, very few companies do this. There are three main ways to adapt AI to your needs, each with different resources requirements:

Full training: If you need to train your model from scratch, you're going to need a lot of data and a lot of compute. Most modern models were trained by using thousands of computers working together for weeks or months, processing billions of documents. This is how base or fundational models like GPT or Llama are created, and it's only needed if you're starting your own AI company, or doing deep level research.
Fine-tuning: This is the process of taking a base model, then refining it by adding additional training data so it gets better at a specific task. For example, you could take a base model like Llama 3, trained on very generic knowledge, and fine-tune it by feeding it hundreds of legal documents, to make it more in-tune with the law world. This is a great way to adapt a model to a specific industry, and only requires hundreds of documents. This can also be done by using much fewer resources.
Retrieval-Augmented Generation (RAG): Instead of modifying the model with fine-tuning, what you can do is add information to the context window from a vector database. This is by far the easiest method, and allows you to always provide recent information to your prompts. It requires minimal resources, and doesn't lock the model to a specific time period. However, the more data you add, the slower the performance will be, so there is a practical limit to how much additional data you can feed through this method. This is ideal for use-cases that require up-to-date answers like customer service chat bots.

When it comes to images, you may have heard terms like model checkpoints and LoRA (Low-Rank Adaptation) which relate to the same concepts. A checkpoint is a saved state for a base model, done during its training process. This allows the training to be paused and resumed at a later point. LoRA is a method of fine-tuning an existing model. For example, you may have a model trained on a large quantity of generic images. But if you need a model that is especially good at identifying breeds of dogs, you can create a LoRA by feeding that model hundreds of pictures of dogs with captions specifying the breed.

Which tools to use

Now that we've covered the basic concepts, we can come to the final decision of which tools you should use for your specific use case. There are many options, both cloud services and self-hosted. Some applications are easy to use and can work out of the box, like ChatGPT for generic questions, or GitHub Copilot as a coding assistant. Then there are services that provide an API so you can embed them in your own applications, such as the OpenAI API. There are also base models that can be fine-tuned directly through the cloud, like Amazon Bedrock. Finally, you may instead want to host your own model, either a base model that's available open source, or one you trained or fine tuned. Here are some common use cases that can be improved with AI with suggested tools:

Field	Use case	Model types	Sample tools
Healthcare	Image analysis (X-Ray, MRIs)	Vision model using CNNs or transformers	OpenCV, SimpleITK, IBM Watson Health
	Drug discovery and protein folding	Transformers	Google DeepMind AlphaFold, OpenFold, Azure Drug Discovery
	Virtual health assistants	Chat bot (LLMs)	Rasa, IBM Watson, Google Dialogflow
Finance	Fraud detection	Anomaly detection models	AWS Fraud Detector, H2O AI, Scikit-learn
	Algorithmic trading	Reinforcement learning	Stable Baseline, TensorTrade
	Risk assessment	Predictive model	XGBoost, LightGBM
Manufacturing	Predictive maintenance	Time series models (LSTMs)	Azure AI for Predictive Maintenance, TensorFlow, Prophet
	Quality control	Vision model (transformer)	AWS Rekognition, OpenCV, Google Vision AI
Transportation	Autonomous vehicles	Multimodal models	AWS Bedrock, TensorFlow, Autoware, Apollo Auto
	Traffic optimization	Reinforcement learning	Azure Percept, OpenAI Gym, TensorFlow
Education	Personalized learning	Recommendation engines	Google Education, AWS Personalize, Surprise
	Automated grading	NLP (transformers)	Google Natural Language, Amazon Comprehend, spaCy
Retail	Product recommendations	Collaborative filtering, transformers	AWS Personalize, PredictionIO, RecBole
	Customer service	Chatbots (LLMs)	Google DialogFlow, IBM Watson Assistant, OpenAI API, Rasa
Cybersecurity	Threat detection	Anomaly detection models	AWS GuardDuty, Microsoft Defender, H2O AI
	Automated security response	Reinforcement learning	Microsoft Sentinel, AWS Security Hub, TensorFlow

Of course each use case is different and you should evaluate tools based on your specific needs, but hopefully this primer gave you some good basis on how to do this.