Documentation

Review LLM Options

Info on supported LLM performance, cost, latency and licensing designed to help you short-list LLMs for testing

Konko AI provides fully managed access to the following top performing LLMs:

  1. Meta-Llama-2-13b-chat
  2. Meta-Llama-2-7b-chat
  3. Defog-sqlcoder2
  4. CodeLlama-34b-Instruct-hf
  5. CodeLlama-34b-Python-hf
  6. Bigcode-starcoder
  7. Mistral-7B-v0.1
  8. Togethercomputer-Llama-2-7B-32K-Instruct
  9. Tiiuae-falcon-40b

Additionally, it provides proxy access to prominent OpenAI models such as:

  1. GPT-4
  2. GPT-3.5

These models can be accessed through Konko AI's generate endpoint free of charge. This means you can test these models without setting up any inferencing infrastructure or paying per hour for GPUs.

Once you are done testing you can move to production by following these steps outlined here.

These models were curated based on overall performance at text-generation tasks and availability for commercial use.


Factors to Consider When Selecting an LLM


When it comes to integrating LLMs into your workflows, various factors must be meticulously evaluated to ensure alignment with use-cases, technical requisites, and budgetary limits:

  1. Performance
    1. Detailed Understanding: Acknowledge the variance in LLM performance based on data, parameters, and training methods.
    2. Improvement Strategies: Leverage techniques like RAG and fine-tuning to enhance model effectiveness.
  2. Latency and Model Size
    1. Speed vs. Size Trade-Off: A mindful equilibrium between the model's computational speed and size, ensuring optimal functionality.
    2. Capacity vs. Cost: Navigate through decisions that balance the prowess of larger models with feasible budgetary outlines.
  3. Financial Viability
    1. Computational Expenditure: Acknowledge larger models often bring about higher computational and financial demands.
    2. Proprietary Costs: Especially note that proprietary models, like those from OpenAI, come with substantial financial requisites.
  4. Context Window Length
    1. Input Management: Ensure that the chosen LLM can effectively handle your necessary input size.
    2. Output Relevancy: Ensure outputs remain relevant and comprehensible in relation to the input.
  5. Linguistic Competency
    1. Language Proficiency: Ensure proficiency in the languages pertinent to your user base and data sets.
    2. Multilingual Capabilities: In instances of multiple languages, evaluate the model’s adaptability and accuracy across languages.

Supported LLMs - Basic Information


LLMTop use-casesOpen SourceLicensable for commercial use?Context window Latency
(vs. GPT-4 )
Cost
(vs. GPT-4)
Top languagesStatus
Meta-Llama-2-70b-chatYesYes (Llama 2 License)4,096
Meta-Llama-2-13b-chatYesYes (Llama 2 License)4,096
GPT-4NoYes (OpenAI License)8,192 - 32,768
GPT-3.5NoYes (OpenAI License)4,096 - 16,384

Supported LLMs - Performance Benchmarking


The table below is meant to give Konko users a sense for the comparative performance of each model across a variety of tests. This is means to inform your model evaluation and testing process.

Model performance is expressed in terms of percentile rank against other LLMs within the Open LLM Leaderboard across a variety of tasks

LLMAverage
across tests
Reasoning
(Al2 challenge)
Common Sense
(HellaSwag test)
Accuracy
(MMLU test)
Factualness
(Truthful QA test)
Meta-Llama-2-70b-chat92nd
percentile
91st
percentile
93rd
percentile
92nd
percentile
46th
percentile
Meta-Llama-2-13b-chat72nd
percentile
72nd
percentile
72nd
percentile
73rd
percentile
46th
percentile
GPT-4100th
percentile
100th
percentile
100th
percentile
100th
percentile
98th
percentile
GPT-3.599th
percentile
100th
percentile
91st
percentile
97th
percentile
59th
percentile

How to read this chart: "92nd percentile" means that the model outperforms 92% of models within the Open LLM Leaderboard

Source Note: The scores for GPT-3.5 and GPT-4 have been derived from their respective research papers.

Sources: HuggingFace, Eleuther AI Language Model Evaluation Harness


The scores above are indicative. They reflect performance across 5 key tests:

  1. AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
  2. HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
  3. MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
  4. TruthfulQA (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.

We chose these measures of performance as they test the LLMs across a variety of tasks (reasoning, common sense, accuracy and factualness).


Supported LLMs - Inference Speed and Cost


Below we show indicative inference speed and cost ranges for Konko AI supported models.

  • Speed ranges are expressed on a millisecond per token basis
  • Cost ranges are expressed as a percentage of inference cost for GPT-3.5 turbo

The speed and cost ranges reflect the inherent tradeoff between hardware performance and cost: Konko AI supported LLMs can run on a range of GPUs. Larger, more powerful GPUs will offer faster inference at a higher cost


Konko AI supported model speed (relative to GPT-3.5 turbo speed)



Konko AI supported model inference cost (as a % of GPT-3.5 turbo cost)


Note: cost figures assume GPT-3.5 Turbo 4k context (this is the cheapest version of GPT-3.5 Turbo)


Requesting additional LLMs


We are constantly evaluating the LLM landscape and adding top-performing and commercially usable LLMs to our list of supported models as they come out


If you would like to request additional LLMs to be included within the Konko API's generate end-point please email us at [email protected] and tell us why