AI
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”
— Charles Dickens, A Tale of Two Cities
Tutorials
A Guide to Quantization in LLMs
Local inference
Hardware requirements
- CPU and GPU mode is supported
- You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
- On Apple M1 and higher Ollama performs great with 7B models due to the Unified memory architecture
- High memory bandwidth (typically GPU) is the top priority to achieve good performance
HW performance benchmarks:
- Performance of llama.cpp on Apple Silicon M-series
- Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference
- Comparing Throughput Performance of Running Local LLMs and VLM on different systems
Which model to choose?
-> See models
AI LLM Frameworks
fabric is an open-source framework for augmenting humans using AI. It provides a modular framework for solving specific problems using a crowdsourced set of AI prompts that can be used anywhere.
https://github.com/danielmiessler/fabric
LM Studio
Discover, download, and run local LLMs Run any Llama 3, Phi 3, Falcon, Mistral, StarCoder, Gemma, gguf models from Hugging Face
Text - Speech
Zero-Shot Speech Editing and Text-to-Speech in the Wild
https://github.com/jasonppy/VoiceCraft
A deep learning toolkit for Text-to-Speech, battle-tested in research and production
https://github.com/coqui-ai/TTS
OpenAI parameters
- Temperature: Controls randomness. Lowering the temperature means that the model produces more repetitive and deterministic responses. Increasing the temperature results in more unexpected or creative responses. Try adjusting temperature or Top P but not both.
- Max length (tokens): Set a limit on the number of tokens per model response. The API supports a maximum of 4000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model’s response. One token is roughly four characters for typical English text.
- Stop sequences: Make responses stop at a desired point, such as the end of a sentence or list. Specify up to four sequences where the model will stop generating further tokens in a response. The returned text won’t contain the stop sequence.
- Top probabilities (Top P): Similar to temperature, this controls randomness but uses a different method. Lowering Top P narrows the model’s token selection to likelier tokens. Increasing Top P lets the model choose from tokens with both high and low likelihood. Try adjusting temperature or Top P but not both.
- Frequency penalty: Reduce the chance of repeating a token proportionally based on how often it has appeared in the text so far. This decreases the likelihood of repeating the exact same text in a response.
- Presence penalty: Reduce the chance of repeating any token that has appeared in the text at all so far. This increases the likelihood of introducing new topics in a response.
- Pre-response text: Insert text after the user’s input and before the model’s response. This can help prepare the model for a response.
- Post-response text: Insert text after the model’s generated response to encourage further user input, as when modeling a conversation.
Source: https://learn.microsoft.com/en-us/training/modules/get-started-openai/7-use-azure-openai-playground
Ollama parameters: https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values
Development
JSON data output
- Instructor - structured outputs for LLMs
- Outlines - Structured Text Generation
- awesome-llm-json - Resource list for generating JSON using LLMs via function calling, tools, CFG. Libraries, Models, Notebooks, etc.
RAG - Retrieval Augmented Generation
Turn your enterprise data into production-ready LLM applications LlamaIndex is the leading data framework for building LLM applications