Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026
Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026 Traditional web scraping breaks when sites change their HTML structure. LLM-based extraction doesn't — you describe what you want in plain English, and the model finds it regardless of how the page is structured. Here's when this approach beats traditional scraping, and the complete implementation. The Core Idea Traditional scraping: price = soup . find ( ' span ' , class_ = ' product-price ' ). text # Breaks if class changes LLM extraction: price = llm_extract ( " What is the product price on this page? " , page_html ) # Works even if the structure changes completely The trade-off: LLM extraction costs money and is slower. Traditional scraping is free and fast. Use LLMs when: Structure changes frequently (news site
Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026
Traditional web scraping breaks when sites change their HTML structure. LLM-based extraction doesn't — you describe what you want in plain English, and the model finds it regardless of how the page is structured.
Here's when this approach beats traditional scraping, and the complete implementation.
The Core Idea
Traditional scraping:
price = soup.find('span', class_='product-price').text # Breaks if class changes_
Enter fullscreen mode
Exit fullscreen mode
LLM extraction:
price = llm_extract("What is the product price on this page?", page_html)
Works even if the structure changes completely`
Enter fullscreen mode
Exit fullscreen mode
The trade-off: LLM extraction costs money and is slower. Traditional scraping is free and fast. Use LLMs when:
-
Structure changes frequently (news sites, e-commerce with AB testing)
-
You're scraping many different sites and can't maintain per-site parsers
-
You need semantic understanding (sentiment, summaries, classifications)
-
The data is in tables, PDFs, images, or other unstructured formats
Method 1: Direct HTML → Structured JSON with GPT-4o-mini
from openai import OpenAI from bs4 import BeautifulSoup import requests, jsonfrom openai import OpenAI from bs4 import BeautifulSoup import requests, jsonclient = OpenAI() # Uses OPENAI_API_KEY env var
def extract_with_gpt(url: str, schema: dict) -> dict: """ Extract structured data from a webpage using GPT-4o-mini.
schema: dict describing what to extract Example: {"product_name": "str", "price": "float", "rating": "float", "reviews_count": "int"} """
Get the page HTML
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"})
Clean HTML to reduce token count (remove scripts, styles, comments)
soup = BeautifulSoup(r.text, 'html.parser') for tag in soup.find_all(['script', 'style', 'noscript', 'meta', 'link']): tag.decompose()
Get clean text (much cheaper than full HTML)
clean_text = soup.get_text(separator='\n', strip=True)
Truncate to fit context window (keep relevant parts)
max_chars = 12000 # ~3000 tokens if len(clean_text) > max_chars: clean_text = clean_text[:max_chars] + "\n...[truncated]"
schema_str = json.dumps(schema, indent=2)
response = client.chat.completions.create( model="gpt-4o-mini", # Cheapest option, ~$0.0002 per page messages= { "role": "system", "content": f"""Extract data from the webpage text and return ONLY a JSON object matching this schema: {schema_str} [blocked]
Rules:
- Return ONLY valid JSON, no other text
- Use null for missing fields
- Convert prices to numbers (remove currency symbols)
- If a field isn't found, return null""" }, { "role": "user", "content": f"Extract data from this webpage:\n\n{clean_text}" } ], temperature=0 # Deterministic output )
result_text = response.choices[0].message.content
try: return json.loads(result_text) except json.JSONDecodeError:
Sometimes models add markdown code blocks
import re match = re.search(r'```
(?:json)?\s*(.*?)\s**
if match:
return json.loads(match.group(1))
raise
# Usage
product_data = extract_with_gpt(
"https://www.amazon.com/dp/B09G9FPHY6",
schema={
"product_name": "str",
"price": "float",
"rating": "float",
"reviews_count": "int",
"availability": "str",
"brand": "str"
}
)
print(json.dumps(product_data, indent=2))
# {
# "product_name": "Echo Dot (5th Gen)",
# "price": 49.99,
# "rating": 4.7,
# "reviews_count": 123456,
# "availability": "In Stock",
# "brand": "Amazon"
# }`
Enter fullscreen mode
Exit fullscreen mode
Cost estimate: GPT-4o-mini at ~3000 tokens/page = ~$0.0002/page = $0.20 per 1000 pages.
## Method 2: Structured Output with Pydantic Validation
Force the LLM to return valid structured data using Pydantic:
`from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, List
import requests
from bs4 import BeautifulSoup
client = OpenAI()
class ProductReview(BaseModel):
rating: float = Field(ge=1, le=5)
text: str
author: str
date: Optional[str]
verified_purchase: bool = False
class ProductData(BaseModel):
name: str
price: Optional[float]
currency: str = "USD"
rating: Optional[float] = Field(None, ge=0, le=5)
review_count: Optional[int]
availability: Optional[str]
features: List[str] = []
top_reviews: List[ProductReview] = []
def extract_product_structured(url: str) -> ProductData:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
for tag in soup.find_all(['script', 'style']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)[:12000]
# Use OpenAI's structured output feature
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract product information from this webpage. Include top 3 reviews if available."},
{"role": "user", "content": text}
],
response_format=ProductData
)
return response.choices[0].message.parsed
product = extract_product_structured("https://www.amazon.com/dp/...")
print(f"Product: {product.name}")
print(f"Price: ${product.price}")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")`
Enter fullscreen mode
Exit fullscreen mode
## Method 3: Claude for Long Documents
Claude handles larger contexts better — useful for long articles, reports, or multi-page PDFs:
`import anthropic
import requests
from bs4 import BeautifulSoup
import json
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
def extract_with_claude(url: str, extraction_prompt: str) -> str:
"""
Extract information from a webpage using Claude.
Better for: long pages, nuanced extraction, natural language output.
"""
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
for tag in soup.find_all(['script', 'style', 'noscript']):
tag.decompose()
# Claude can handle up to 200K tokens — much more headroom
text = soup.get_text(separator='\n', strip=True)[:50000]
message = client.messages.create(
model="claude-haiku-3-5", # Cheapest Claude model
max_tokens=2000,
messages=[{
"role": "user",
"content": f"{extraction_prompt}\n\nWebpage content:\n{text}"
}]
)
return message.content[0].text
# Example: Extract pricing table from a SaaS pricing page
pricing = extract_with_claude(
"https://example-saas.com/pricing",
"""Extract the pricing information and return it as a JSON array.
Each plan should have: name, monthly_price, annual_price, features (list).
Return ONLY the JSON, no other text."""
)
print(json.loads(pricing))`
Enter fullscreen mode
Exit fullscreen mode
## Method 4: Async Batch Extraction (Production Scale)
For scraping 100+ pages cost-effectively:
`import asyncio
import aiohttp
from openai import AsyncOpenAI
from bs4 import BeautifulSoup
import json
from typing import List, Dict
client = AsyncOpenAI()
async def fetch_page(session: aiohttp.ClientSession, url: str) -> str:
"""Fetch a page asynchronously."""
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
async with session.get(url, headers=headers) as response:
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all(['script', 'style']):
tag.decompose()
return soup.get_text(separator='\n', strip=True)[:8000]
async def extract_from_page(text: str, schema: dict) -> dict:
"""Extract structured data using GPT-4o-mini."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Return ONLY a JSON object matching: {json.dumps(schema)}"},
{"role": "user", "content": text}
],
temperature=0
)
try:
return json.loads(response.choices[0].message.content)
except:
return {}
async def batch_extract(urls: List[str], schema: dict, concurrency: int = 5) -> List[Dict]:
"""Extract data from many URLs concurrently."""
semaphore = asyncio.Semaphore(concurrency)
async def process_url(session, url):
async with semaphore:
text = await fetch_page(session, url)
result = await extract_from_page(text, schema)
result['_url'] = url
return result
async with aiohttp.ClientSession() as session:
tasks = [process_url(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, dict)]
# Usage: extract from 100 product pages
urls = ["https://shop.example.com/product/1", "https://shop.example.com/product/2", ...]
schema = {"name": "str", "price": "float", "in_stock": "bool"}
results = asyncio.run(batch_extract(urls, schema, concurrency=5))
print(f"Extracted {len(results)} products")
# Cost for 100 pages at ~$0.0002/page = $0.02`
Enter fullscreen mode
Exit fullscreen mode
## Hybrid Approach: Fast + Accurate
Use CSS selectors first, fall back to LLM only when they fail:
`import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json
client = OpenAI()
SELECTORS = {
'amazon': {
'name': '#productTitle',
'price': '.a-price-whole',
'rating': '.a-icon-alt',
},
'generic': None # Fall back to LLM
}
def extract_price_hybrid(url: str) -> dict:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
# Try fast CSS selector first
if 'amazon.com' in url:
selectors = SELECTORS['amazon']
result = {}
all_found = True
for field, selector in selectors.items():
elem = soup.select_one(selector)
if elem:
result[field] = elem.text.strip()
else:
all_found = False
break
if all_found:
return {'method': 'css', 'data': result}
# Fallback to LLM
for tag in soup.find_all(['script', 'style']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)[:10000]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": 'Return ONLY JSON: {"name": "str", "price": "float", "rating": "float"}'},
{"role": "user", "content": text}
],
temperature=0
)
try:
return {'method': 'llm', 'data': json.loads(response.choices[0].message.content)}
except:
return {'method': 'failed', 'data': {}}
result = extract_price_hybrid("https://amazon.com/dp/...")
print(f"Method: {result['method']}, Data: {result['data']}")`
Enter fullscreen mode
Exit fullscreen mode
## Cost Comparison by Volume
Volume
CSS-only
GPT-4o-mini
GPT-4o
Claude Haiku
100 pages
$0
$0.02
$0.50
$0.03
1,000 pages
$0
$0.20
$5.00
$0.30
10,000 pages
$0
$2.00
$50.00
$3.00
100,000 pages
$0
$20.00
$500.00
$30.00
Practical rule: Use CSS selectors for sites you control or know well. Use LLMs for:
- Unknown site structures (one-off extractions)
- Semantic extraction (summaries, classifications)
- Fallback when selectors break
## Related Articles
- Python Web Scraping Tutorial for Beginners 2026 — Traditional scraping foundation
- Web Scraping Tools Comparison 2026 — When to use which tool
- How to Validate LLM Outputs in Production — Validating extracted data
Want ready-to-use web scraping tools without the setup hassle?
The $29 Apify Scrapers Bundle includes 35+ production-ready scrapers — Google SERP, LinkedIn, Amazon, TikTok Shop, contact info, and more. One-time payment, instant download.
👉 Get the Bundle ($29)
## Related Tools
- ai-news-summarizer
- b2b-review-intelligence*
if match:
return json.loads(match.group(1))
raise
# Usage
product_data = extract_with_gpt(
"https://www.amazon.com/dp/B09G9FPHY6",
schema={
"product_name": "str",
"price": "float",
"rating": "float",
"reviews_count": "int",
"availability": "str",
"brand": "str"
}
)
print(json.dumps(product_data, indent=2))
# {
# "product_name": "Echo Dot (5th Gen)",
# "price": 49.99,
# "rating": 4.7,
# "reviews_count": 123456,
# "availability": "In Stock",
# "brand": "Amazon"
# }`
Enter fullscreen mode
Exit fullscreen mode
Cost estimate: GPT-4o-mini at ~3000 tokens/page = ~$0.0002/page = $0.20 per 1000 pages.
## Method 2: Structured Output with Pydantic Validation
Force the LLM to return valid structured data using Pydantic:
`from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, List
import requests
from bs4 import BeautifulSoup
client = OpenAI()
class ProductReview(BaseModel):
rating: float = Field(ge=1, le=5)
text: str
author: str
date: Optional[str]
verified_purchase: bool = False
class ProductData(BaseModel):
name: str
price: Optional[float]
currency: str = "USD"
rating: Optional[float] = Field(None, ge=0, le=5)
review_count: Optional[int]
availability: Optional[str]
features: List[str] = []
top_reviews: List[ProductReview] = []
def extract_product_structured(url: str) -> ProductData:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
for tag in soup.find_all(['script', 'style']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)[:12000]
# Use OpenAI's structured output feature
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract product information from this webpage. Include top 3 reviews if available."},
{"role": "user", "content": text}
],
response_format=ProductData
)
return response.choices[0].message.parsed
product = extract_product_structured("https://www.amazon.com/dp/...")
print(f"Product: {product.name}")
print(f"Price: ${product.price}")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")`
Enter fullscreen mode
Exit fullscreen mode
## Method 3: Claude for Long Documents
Claude handles larger contexts better — useful for long articles, reports, or multi-page PDFs:
`import anthropic
import requests
from bs4 import BeautifulSoup
import json
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
def extract_with_claude(url: str, extraction_prompt: str) -> str:
"""
Extract information from a webpage using Claude.
Better for: long pages, nuanced extraction, natural language output.
"""
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
for tag in soup.find_all(['script', 'style', 'noscript']):
tag.decompose()
# Claude can handle up to 200K tokens — much more headroom
text = soup.get_text(separator='\n', strip=True)[:50000]
message = client.messages.create(
model="claude-haiku-3-5", # Cheapest Claude model
max_tokens=2000,
messages=[{
"role": "user",
"content": f"{extraction_prompt}\n\nWebpage content:\n{text}"
}]
)
return message.content[0].text
# Example: Extract pricing table from a SaaS pricing page
pricing = extract_with_claude(
"https://example-saas.com/pricing",
"""Extract the pricing information and return it as a JSON array.
Each plan should have: name, monthly_price, annual_price, features (list).
Return ONLY the JSON, no other text."""
)
print(json.loads(pricing))`
Enter fullscreen mode
Exit fullscreen mode
## Method 4: Async Batch Extraction (Production Scale)
For scraping 100+ pages cost-effectively:
`import asyncio
import aiohttp
from openai import AsyncOpenAI
from bs4 import BeautifulSoup
import json
from typing import List, Dict
client = AsyncOpenAI()
async def fetch_page(session: aiohttp.ClientSession, url: str) -> str:
"""Fetch a page asynchronously."""
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
async with session.get(url, headers=headers) as response:
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all(['script', 'style']):
tag.decompose()
return soup.get_text(separator='\n', strip=True)[:8000]
async def extract_from_page(text: str, schema: dict) -> dict:
"""Extract structured data using GPT-4o-mini."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Return ONLY a JSON object matching: {json.dumps(schema)}"},
{"role": "user", "content": text}
],
temperature=0
)
try:
return json.loads(response.choices[0].message.content)
except:
return {}
async def batch_extract(urls: List[str], schema: dict, concurrency: int = 5) -> List[Dict]:
"""Extract data from many URLs concurrently."""
semaphore = asyncio.Semaphore(concurrency)
async def process_url(session, url):
async with semaphore:
text = await fetch_page(session, url)
result = await extract_from_page(text, schema)
result['_url'] = url
return result
async with aiohttp.ClientSession() as session:
tasks = [process_url(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, dict)]
# Usage: extract from 100 product pages
urls = ["https://shop.example.com/product/1", "https://shop.example.com/product/2", ...]
schema = {"name": "str", "price": "float", "in_stock": "bool"}
results = asyncio.run(batch_extract(urls, schema, concurrency=5))
print(f"Extracted {len(results)} products")
# Cost for 100 pages at ~$0.0002/page = $0.02`
Enter fullscreen mode
Exit fullscreen mode
## Hybrid Approach: Fast + Accurate
Use CSS selectors first, fall back to LLM only when they fail:
`import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json
client = OpenAI()
SELECTORS = {
'amazon': {
'name': '#productTitle',
'price': '.a-price-whole',
'rating': '.a-icon-alt',
},
'generic': None # Fall back to LLM
}
def extract_price_hybrid(url: str) -> dict:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, 'html.parser')
# Try fast CSS selector first
if 'amazon.com' in url:
selectors = SELECTORS['amazon']
result = {}
all_found = True
for field, selector in selectors.items():
elem = soup.select_one(selector)
if elem:
result[field] = elem.text.strip()
else:
all_found = False
break
if all_found:
return {'method': 'css', 'data': result}
# Fallback to LLM
for tag in soup.find_all(['script', 'style']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)[:10000]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": 'Return ONLY JSON: {"name": "str", "price": "float", "rating": "float"}'},
{"role": "user", "content": text}
],
temperature=0
)
try:
return {'method': 'llm', 'data': json.loads(response.choices[0].message.content)}
except:
return {'method': 'failed', 'data': {}}
result = extract_price_hybrid("https://amazon.com/dp/...")
print(f"Method: {result['method']}, Data: {result['data']}")`
Enter fullscreen mode
Exit fullscreen mode
## Cost Comparison by Volume
Volume
CSS-only
GPT-4o-mini
GPT-4o
Claude Haiku
100 pages
$0
$0.02
$0.50
$0.03
1,000 pages
$0
$0.20
$5.00
$0.30
10,000 pages
$0
$2.00
$50.00
$3.00
100,000 pages
$0
$20.00
$500.00
$30.00
Practical rule: Use CSS selectors for sites you control or know well. Use LLMs for:
- Unknown site structures (one-off extractions)
- Semantic extraction (summaries, classifications)
- Fallback when selectors break
## Related Articles
- Python Web Scraping Tutorial for Beginners 2026 — Traditional scraping foundation
- Web Scraping Tools Comparison 2026 — When to use which tool
- How to Validate LLM Outputs in Production — Validating extracted data
Want ready-to-use web scraping tools without the setup hassle?
The $29 Apify Scrapers Bundle includes 35+ production-ready scrapers — Google SERP, LinkedIn, Amazon, TikTok Shop, contact info, and more. One-time payment, instant download.
👉 Get the Bundle ($29)
## Related Tools
- ai-news-summarizer
- b2b-review-intelligence*
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemodelavailable
Gemma 4 Architecture Comparison
Flagship open-weight release days are always exciting. Was just reading through the Gemma 4 reports, configs, and code, and here are my takeaways: Architecture-wise, besides multi-model support, Gemma 4 (31B) looks pretty much unchanged compared to Gemma 3 (27B). Link to the comparison page: https://sebastianraschka.com/llm-architecture-gallery/?compare=gemma-3-27b 2Cgemma-4-31b Gemma 4 maintains a relatively unique Pre- and Post-norm setup and remains relatively classic, with a 5:1 hybrid attention mechanism combining a sliding-window (local) layer and a full-attention (global) layer. https://preview.redd.it/7bn493789zsg1.png?width=1444 format=png auto=webp s=4b28421ed276cb0b1ba133e3c325d446d68ea1ef The attention mechanism itself is also classic Grouped Query Attention (GQA). But let’s no

Gemma 4 is great at real-time Japanese - English translation for games
When Gemma 3 27B QAT IT was released last year, it was SOTA for local real-time Japanese-English translation for visual novel for a while. So I want to see how Gemma 4 handle this use case. Model: Unsloth's gemma-4-26B-A4B-it-UD-Q5_K_M Context: 8192 Reasoning: OFF Softwares: Front end: Luna Translator Back end: LM Studio Workflow: Luna hooks the dialogue and speaker's name from the game. A Python script structures the hooked text (add name, gender). Luna sends the structured text and a system prompt to LM Studio Luna shows the translation. What Gemma 4 does great: Even with reasoning disabled, Gemma 4 follows instructions in system prompt very well. With structured text, gemma 4 deals with pronouns well. This is one of the biggest challenges because Japanese spoken dialogue often omit subj
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Gemma 4 Architecture Comparison
Flagship open-weight release days are always exciting. Was just reading through the Gemma 4 reports, configs, and code, and here are my takeaways: Architecture-wise, besides multi-model support, Gemma 4 (31B) looks pretty much unchanged compared to Gemma 3 (27B). Link to the comparison page: https://sebastianraschka.com/llm-architecture-gallery/?compare=gemma-3-27b 2Cgemma-4-31b Gemma 4 maintains a relatively unique Pre- and Post-norm setup and remains relatively classic, with a 5:1 hybrid attention mechanism combining a sliding-window (local) layer and a full-attention (global) layer. https://preview.redd.it/7bn493789zsg1.png?width=1444 format=png auto=webp s=4b28421ed276cb0b1ba133e3c325d446d68ea1ef The attention mechanism itself is also classic Grouped Query Attention (GQA). But let’s no

Smaller models are getting scary good.
I am still processing this lol. I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled). Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal. The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken. I've attac
![[Appreciation Post] Gemma 4 E2B. My New Daily Driver 😁](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-graph-nodes-a2pnJLpyKmDnxKWLd5BEAb.webp)
[Appreciation Post] Gemma 4 E2B. My New Daily Driver 😁
idk but this thing feels like magic in the palm of my hands. I am running it on my Pixel 10 Pro with AI Edge Gallery by Google. The phone itself is only using CPU acceleration for some reason and therefore the E4B version felt a little to slow. However, with the E2B it runs perfect. Faster than I can read and follow along and has some function calling in the app. I am running it at the max 32K context and switch thinking on and off when I need. It seem ridiculously intelligent. Feels like a 7b model. I'm sure there is some recency bias here. But just having it run at the speed it does on my phone with it's intelligence feels special. Are you guys having a good experience with the E models? submitted by /u/Prestigious-Use5483 [link] [comments]

Gemma-4-31B NVFP4 inference numbers on 1x RTX Pro 6000
Ran a quick inference sweep on gemma 4 31B in NVFP4 (using nvidia/Gemma-4-31B-IT-NVFP4 ). The NVFP4 checkpoint is 32GB, half of the BF16 size from google (63GB), likely a mix of BF16 and FP4 roughly equal to FP8 in size. This model uses a ton of VRAM for kv cache. I dropped the kv cache precision to FP8. All numbers are steady-state averages under sustained load using locust and numbers below are per-user metrics to show user interactivity. 1K output. vLLM. Per-User Generation Speed (tok/s) Context 1 User 2 Users 3 Users 4 Users 1K 40.7 36.6 36.1 35.1 8K 39.9 36.5 34.8 32.7 32K 40.5 28.9 25.3 23.5 64K 44.5 27.4 26.7 14.3 96K 34.4 19.5 12.5 9.5 128K 38.3 - - - Time to First Token Context 1 User 2 Users 3 Users 4 Users 1K 0.1s 0.1s 0.2s 0.2s 8K 1.0s 1.4s 1.7s 2.0s 32K 5.5s 8.1s 10.0s 12.6s 6



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!