💰

Text Generation Cost Analysis

NameSize ClassAvg cost / 10k tokensReplica Req / MinAvg Tokens / RequestReplica $ / minCost / Request
Coreweave
6B
5
24
135
0.012
$0.00056
AI21 (token)
20B
NaN
150
$0.0023
AI21
175B
123.33333333333334
60
145
0.74
$0.01275
Core (60%)
6B
3
24
135
0.0072
$0.000333
AI21 (token)
175B
NaN
150
$0.017
AI21 (token)
6B
NaN
150
$0.00028
🛑  Forefront
6B
3.137322132693297
58.33
45
0.0183
$0.001
🛑  Forefront
20B
52.37113402061856
9.7
100
0.0508
$0.012
AI21 (wrong)
20B
10
120
150
0.12
$0.001
AI21 (wrong)
6B
4.166666666666667
150
135
0.0625
$0.00046

Legend

Avg Cost / 10k tokens - How many dollars we spend to generate 10k output tokens. Note this includes tokens generated in parallel. So if output tokens per call is 100, and n=3, then that’s 300 output tokens generated. All output tokens are measured by GPT-3 encoding the output text from the model.

Replica Req / Min - How many requests can a single replica serve in a minute? This is usually calculated by looking at the requests per minute in grafana and dividing by the number of replicas we have running.

Avg Tokens / Request - When calculating the above, what is the average number of output tokens requested by users? This can be seen in the same grafana dashboard.

Replica $ / min - How much does it cost us to rent a single replica for a single minute?

AI21 Flexible Pricing

  • Large
    • $0.000054 / request
    • $0.0054 / 1k generated tokens
    • (150 generated tokens (50*3)/1000) * $0.0054 + $0.000054 = $0.000864 / request
    • (10000/150) * $0.000864 = $0.0576 / 10k tokens
  • New Large
    • $0.00180 per 1K generated tokens
    • $0.000018 per request
    • (150 generated tokens (50*3)/1000) * $0.0018 + $0.000018 = $0.000233 / request
  • Grande
    • $0.00014 / request
    • $0.0144 / 1k generated tokens
    • (150 generated tokens (50*3)/1000) * $0.0144 + $0.00014 = $0.0023 / request
    • (10000/150) * $0.0023 = $0.153 / 10k tokens (2.5x what we thought ... still 4x what Forefront was)
    • (75 generated tokens / 1000) * $0.0144 + $0.00014 = $0.00122 / request (after adaptive response caching)
  • Jumbo
    • $0.002 / request
    • $0.1 / 1k generated tokens
    • (150 generated tokens (50*3)/1000) * $0.1 + $0.002 = $0.017 / request
    • (10000/150) * $0.017 = $1.13 / 10k tokens
    • What if we did n=2?
      • $0.012
    • Jumbo Usage?
      • April: 76698*30 = 2,300,940
      • May: 70703*31 = 2,191,793
      • June: 68296*30 = 2,048,880
      • Taking June, we would have paid $34,830 on token pricing. Doing n=2 we would have paid $24,586

Note for switching to Jumbo per request cost: 20 RPM $14.7k, 30 RPM $22.1k, 40 RPM $29.3k, 50 RPM $36.7k

@July 13, 2022 Results of Adaptive Retry

With Wyvern we’re seeing a 65% reduction in total tokens requested. That means that at a current cost of $280/day ($8,400/month) we would instead change to a cost of $3k/month.

With Dragon we’re seeing a 55% reduction in total tokens requested and a 5% increase in requests. That means that at a current cost of $1,040/day ($32,500/month) we would instead change to a cost of $550/day ($17,050)

Today we’re paying $50k/month. I think the bill after next will be $20k.

Once we change over Griffin to this new system, it will change a $40k bill, recently 20% off, into a $6k bill.

Given the cost, here’s the plan moving forward

  1. Sign contract with AI21. Use Grande for now
  2. Push forward on custom finetuning so that we can decrease Jumbo usage to the point that we turn off the dedicated instance and instead use per-token pricing
  3. Sign a 2-year dedicated instance deal with Coreweave

Update: Isaac and Ryan’s Griffin Recalculations (June 2nd 2022)

Cost for all instances = 60 Instances x 0.93/hr x 24hours x 30days = 40,176

Total Griffin Calls = 1019 req/min * 60mins * 24hours * 30days = 44,020,800

Cost / Calls = 0.0000

@September 16, 2022 New cost without dedicated instance

Wyvern:

98.5 RMP, average output tokens at 182/request (Wyvern Hydra n=2, 6 responses)

$0.014 / 1000 tokens out, $0.00014 / request

$0.002688 / request

$11,437.98 Total

Dragon: 31.8 RMP, average output tokens at 161/request

$0.10 / 1000 tokens out, $0.002 / request

$0.0181 / request

$24,865.06 Total

At current usage we will turn $50k/month into $36,303.04 (save $13,696.96)

And we have more room to reduce Dragon usage as well as reduce token usage all around (probably getting another $15k of savings)

Things to test out

  • Are 6 responses for Wyvern Hydra what we should be doing?
  • What if we take Dragon to n=2? n=1?
  • Will Story Style increase Wyvern usage?
  • What about a popup advertising longer memory for Wyvern

@October 10, 2022 More cost changes

Jumbo currently 38.5 RPM and 104 Token ($20,623.68)

New cost estimate of 45 RPM, 70 Token ($17,496)

Initial results are 41 RPM and 54 Tokens ($13,106.88)

Day 1 results: 53 RPM and 51 Tokens ($16,256.16), 2.86 vs 5.85 for empty outputs / 10 minutes

Wyvern Hydra using n=3, 2 heads. 111 RPM, 183 Tokens ($12,956.63)

Initial results going to n=2 with 2 heads are same RMP and 117 Tokens ($8,525.86)

So the 10 minute changes we just made are saving us around $12k/month.