Name | Size Class | Avg cost / 10k tokens | Replica Req / Min | Avg Tokens / Request | Replica $ / min | Cost / Request |
---|---|---|---|---|---|---|
Coreweave | 6B | 5 | 24 | 135 | 0.012 | $0.00056 |
AI21 (token) | 20B | NaN | 150 | $0.0023 | ||
AI21 | 175B | 123.33333333333334 | 60 | 145 | 0.74 | $0.01275 |
Core (60%) | 6B | 3 | 24 | 135 | 0.0072 | $0.000333 |
AI21 (token) | 175B | NaN | 150 | $0.017 | ||
AI21 (token) | 6B | NaN | 150 | $0.00028 | ||
🛑 Forefront | 6B | 3.137322132693297 | 58.33 | 45 | 0.0183 | $0.001 |
🛑 Forefront | 20B | 52.37113402061856 | 9.7 | 100 | 0.0508 | $0.012 |
AI21 (wrong) | 20B | 10 | 120 | 150 | 0.12 | $0.001 |
AI21 (wrong) | 6B | 4.166666666666667 | 150 | 135 | 0.0625 | $0.00046 |
Legend
Avg Cost / 10k tokens - How many dollars we spend to generate 10k output tokens. Note this includes tokens generated in parallel. So if output tokens per call is 100, and n=3, then that’s 300 output tokens generated. All output tokens are measured by GPT-3 encoding the output text from the model.
Replica Req / Min - How many requests can a single replica serve in a minute? This is usually calculated by looking at the requests per minute in grafana and dividing by the number of replicas we have running.
Avg Tokens / Request - When calculating the above, what is the average number of output tokens requested by users? This can be seen in the same grafana dashboard.
Replica $ / min - How much does it cost us to rent a single replica for a single minute?
AI21 Flexible Pricing
- Large
- $0.000054 / request
- $0.0054 / 1k generated tokens
- (150 generated tokens (50*3)/1000) * $0.0054 + $0.000054 = $0.000864 / request
- (10000/150) * $0.000864 = $0.0576 / 10k tokens
- New Large
- $0.00180 per 1K generated tokens
- $0.000018 per request
- (150 generated tokens (50*3)/1000) * $0.0018 + $0.000018 = $0.000233 / request
- Grande
- $0.00014 / request
- $0.0144 / 1k generated tokens
- (150 generated tokens (50*3)/1000) * $0.0144 + $0.00014 = $0.0023 / request
- (10000/150) * $0.0023 = $0.153 / 10k tokens (2.5x what we thought ... still 4x what Forefront was)
- (75 generated tokens / 1000) * $0.0144 + $0.00014 = $0.00122 / request (after adaptive response caching)
- Jumbo
- $0.002 / request
- $0.1 / 1k generated tokens
- (150 generated tokens (50*3)/1000) * $0.1 + $0.002 = $0.017 / request
- (10000/150) * $0.017 = $1.13 / 10k tokens
- What if we did n=2?
- $0.012
- Jumbo Usage?
- April: 76698*30 = 2,300,940
- May: 70703*31 = 2,191,793
- June: 68296*30 = 2,048,880
- Taking June, we would have paid $34,830 on token pricing. Doing n=2 we would have paid $24,586
Note for switching to Jumbo per request cost: 20 RPM $14.7k, 30 RPM $22.1k, 40 RPM $29.3k, 50 RPM $36.7k
With Wyvern we’re seeing a 65% reduction in total tokens requested. That means that at a current cost of $280/day ($8,400/month) we would instead change to a cost of $3k/month.
With Dragon we’re seeing a 55% reduction in total tokens requested and a 5% increase in requests. That means that at a current cost of $1,040/day ($32,500/month) we would instead change to a cost of $550/day ($17,050)
Today we’re paying $50k/month. I think the bill after next will be $20k.
Once we change over Griffin to this new system, it will change a $40k bill, recently 20% off, into a $6k bill.
Given the cost, here’s the plan moving forward
- Sign contract with AI21. Use Grande for now
- Push forward on custom finetuning so that we can decrease Jumbo usage to the point that we turn off the dedicated instance and instead use per-token pricing
- Sign a 2-year dedicated instance deal with Coreweave
Update: Isaac and Ryan’s Griffin Recalculations (June 2nd 2022)
Cost for all instances = 60 Instances x 0.93/hr x 24hours x 30days = 40,176
Total Griffin Calls = 1019 req/min * 60mins * 24hours * 30days = 44,020,800
Cost / Calls = 0.0000
Wyvern:
98.5 RMP, average output tokens at 182/request (Wyvern Hydra n=2, 6 responses)
$0.014 / 1000 tokens out, $0.00014 / request
$0.002688 / request
$11,437.98 Total
Dragon: 31.8 RMP, average output tokens at 161/request
$0.10 / 1000 tokens out, $0.002 / request
$0.0181 / request
$24,865.06 Total
At current usage we will turn $50k/month into $36,303.04 (save $13,696.96)
And we have more room to reduce Dragon usage as well as reduce token usage all around (probably getting another $15k of savings)
Things to test out
- Are 6 responses for Wyvern Hydra what we should be doing?
- What if we take Dragon to n=2? n=1?
- Will Story Style increase Wyvern usage?
- What about a popup advertising longer memory for Wyvern
Jumbo currently 38.5 RPM and 104 Token ($20,623.68)
New cost estimate of 45 RPM, 70 Token ($17,496)
Initial results are 41 RPM and 54 Tokens ($13,106.88)
Day 1 results: 53 RPM and 51 Tokens ($16,256.16), 2.86 vs 5.85 for empty outputs / 10 minutes
Wyvern Hydra using n=3, 2 heads. 111 RPM, 183 Tokens ($12,956.63)
Initial results going to n=2 with 2 heads are same RMP and 117 Tokens ($8,525.86)
So the 10 minute changes we just made are saving us around $12k/month.