HomeFoundation ModelsNatural Language
Natural Language
Large language models that generate, classify, summarize, and reason over text, accessed from Python via provider SDKs or the transformers library.
Natural language foundation models are large language models (LLMs) trained on broad text corpora. They generate coherent prose, summarize documents, classify content, answer questions, and apply multi-step reasoning across open-ended tasks. Many current models also accept image input alongside text, making them multimodal by default. They serve as the base layer for a wide range of downstream applications, from chatbots to document pipelines to code assistants.
Proprietary models are accessed remotely via provider SDKs: openai for GPT-4, anthropic for Claude 3.5. Each SDK exposes a messages-style API that accepts a list of turns and returns a completion. Open-weight models like Llama 3 are distributed as downloadable checkpoints and run locally through the transformers library, which provides a pipeline abstraction and lower-level AutoModelForCausalLM classes for fine-grained control.
The three models here represent both sides of that split. GPT-4 and Claude 3.5 are proprietary, hosted, and reached over HTTP; Llama 3 is fully open-weight, self-hostable, and fine-tunable on your own data and hardware.
OpenAI's proprietary multimodal LLM, widely used for general reasoning, coding, and tool-use tasks.
Why we picked it
GPT-4 is the most referenced proprietary LLM baseline in the Python ecosystem. It is accessed via the `openai` SDK and represents the API-first, closed-weight end of the natural language category. Its multimodal input support, broad tool-use capabilities, and extensive documentation make it the model most teams encounter first when building LLM-backed applications.
Also evaluated
- GPT-4oFaster, lower-cost OpenAI multimodal model; often the practical default over GPT-4 now.
- Gemini 1.5 ProGoogle's multimodal LLM with a very large context window; accessed via the google-genai SDK.
- Mistral LargeProprietary LLM from Mistral AI; strong on multilingual and code tasks.
Anthropic's proprietary multimodal LLM, notable for long-context handling, reasoning, and coding tasks.
Why we picked it
Claude 3.5 represents a distinct proprietary option alongside GPT-4, accessed via the `anthropic` SDK. It is included here because of its differentiated profile: a large context window, consistent behavior on extended documents, and strong coding performance. For teams evaluating provider options, it is the primary alternative to OpenAI's models in the same API-only tier.
Also evaluated
- Claude 3 OpusAnthropic's higher-capability, higher-cost model in the same family.
- Gemini 1.5 FlashGoogle's faster, cheaper multimodal model for high-throughput use cases.
- Command R+Cohere's LLM optimized for retrieval-augmented generation and enterprise search.
Meta's open-weight LLM family, the common base for fine-tuning, self-hosted deployments, and on-premises inference.
Why we picked it
Llama 3 is the most widely used open-weight LLM family in the Python ecosystem and the standard starting point for teams that need to self-host, fine-tune, or run models without a provider dependency. Weights are loadable via `transformers` and compatible with quantization tools like llama.cpp and bitsandbytes. It represents the open-weight tier of this category, a distinct and necessary position alongside the proprietary API models.
Also evaluated
- Mistral 7B / MixtralMistral AI's open-weight models; competitive on quality per compute, strong for European deployments.
- Gemma 2Google's open-weight LLM family; smaller footprint, permissive license.
- FalconTII's open-weight LLM; earlier in the open-LLM generation, less widely fine-tuned than Llama 3.
What I learned
GPT-4 remains a strong general-purpose choice for tasks that require structured output, function calling, or vision alongside text. The openai SDK's response_format parameter and tools interface make it straightforward to integrate into pipelines that need reliable JSON or multi-step tool use. Latency and cost are higher than smaller models, so it fits best where output quality matters more than throughput.
Claude 3.5 handles long contexts well and produces prose that tends to be more directly usable without post-processing. The anthropic SDK's messages API is close in shape to OpenAI's, so switching between the two is low-friction. For tasks involving lengthy documents, nuanced instruction-following, or drafting, Claude 3.5 is worth evaluating alongside GPT-4 rather than treating either as a default.
Llama 3 changes the tradeoff entirely. Running locally eliminates per-token cost and keeps data off external servers, which matters for compliance-sensitive workloads or high-volume batch jobs. Fine-tuning on domain-specific data is possible with standard tools (Hugging Face trl, peft). The practical constraint is hardware: useful inference on the larger variants requires a capable GPU, and setup overhead is higher than a single pip install openai.
For rapid prototyping or tasks with unclear requirements, the proprietary models are faster to iterate with. For production workloads with volume, data-residency constraints, or a need to customize the base model, Llama 3 becomes the more practical path.
The choice between these models depends on data sensitivity, volume, and how much control you need over the model itself. Proprietary APIs offer low setup friction and strong out-of-the-box performance; open-weight models trade that convenience for portability, customizability, and freedom from per-call pricing. Starting with a proprietary model to validate a use case, then evaluating Llama 3 for production scale, is a common and reasonable pattern.