Models claude model available product feature stock

Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026

DEV Communityby Vhub SystemsApril 3, 20267 min read1 views

Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026 Traditional web scraping breaks when sites change their HTML structure. LLM-based extraction doesn't — you describe what you want in plain English, and the model finds it regardless of how the page is structured. Here's when this approach beats traditional scraping, and the complete implementation. The Core Idea Traditional scraping: price = soup . find ( ' span ' , class_ = ' product-price ' ). text # Breaks if class changes LLM extraction: price = llm_extract ( " What is the product price on this page? " , page_html ) # Works even if the structure changes completely The trade-off: LLM extraction costs money and is slower. Traditional scraping is free and fast. Use LLMs when: Structure changes frequently (news site

Using GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026

Traditional web scraping breaks when sites change their HTML structure. LLM-based extraction doesn't — you describe what you want in plain English, and the model finds it regardless of how the page is structured.

Here's when this approach beats traditional scraping, and the complete implementation.

The Core Idea

Traditional scraping:

price = soup.find('span', class_='product-price').text # Breaks if class changes_

Enter fullscreen mode

Exit fullscreen mode

LLM extraction:

price = llm_extract("What is the product price on this page?", page_html)

Works even if the structure changes completely`

Enter fullscreen mode

Exit fullscreen mode

The trade-off: LLM extraction costs money and is slower. Traditional scraping is free and fast. Use LLMs when:

Structure changes frequently (news sites, e-commerce with AB testing)
You're scraping many different sites and can't maintain per-site parsers
You need semantic understanding (sentiment, summaries, classifications)
The data is in tables, PDFs, images, or other unstructured formats

Method 1: Direct HTML → Structured JSON with GPT-4o-mini

from openai import OpenAI from bs4 import BeautifulSoup import requests, json

from openai import OpenAI from bs4 import BeautifulSoup import requests, json

client = OpenAI() # Uses OPENAI_API_KEY env var

def extract_with_gpt(url: str, schema: dict) -> dict: """ Extract structured data from a webpage using GPT-4o-mini.

schema: dict describing what to extract Example: {"product_name": "str", "price": "float", "rating": "float", "reviews_count": "int"} """

Get the page HTML

r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"})

Clean HTML to reduce token count (remove scripts, styles, comments)

soup = BeautifulSoup(r.text, 'html.parser') for tag in soup.find_all(['script', 'style', 'noscript', 'meta', 'link']): tag.decompose()

Get clean text (much cheaper than full HTML)

clean_text = soup.get_text(separator='\n', strip=True)

Truncate to fit context window (keep relevant parts)

max_chars = 12000 # ~3000 tokens if len(clean_text) > max_chars: clean_text = clean_text[:max_chars] + "\n...[truncated]"

schema_str = json.dumps(schema, indent=2)

response = client.chat.completions.create( model="gpt-4o-mini", # Cheapest option, ~$0.0002 per page messages= { "role": "system", "content": f"""Extract data from the webpage text and return ONLY a JSON object matching this schema: {schema_str} [blocked]

Rules:

Return ONLY valid JSON, no other text
Use null for missing fields
Convert prices to numbers (remove currency symbols)
If a field isn't found, return null""" }, { "role": "user", "content": f"Extract data from this webpage:\n\n{clean_text}" } ], temperature=0 # Deterministic output )

result_text = response.choices[0].message.content

try: return json.loads(result_text) except json.JSONDecodeError:

Sometimes models add markdown code blocks

import re match = re.search(r'```

(?:json)?\s*(.*?)\s**

 if match:
 return json.loads(match.group(1))
 raise

# Usage
product_data = extract_with_gpt(
 "https://www.amazon.com/dp/B09G9FPHY6",
 schema={
 "product_name": "str",
 "price": "float",
 "rating": "float", 
 "reviews_count": "int",
 "availability": "str",
 "brand": "str"
 }
)

print(json.dumps(product_data, indent=2))
# {
# "product_name": "Echo Dot (5th Gen)",
# "price": 49.99,
# "rating": 4.7,
# "reviews_count": 123456,
# "availability": "In Stock",
# "brand": "Amazon"
# }`


Enter fullscreen mode
 


 Exit fullscreen mode


Cost estimate: GPT-4o-mini at ~3000 tokens/page = ~$0.0002/page = $0.20 per 1000 pages.


## Method 2: Structured Output with Pydantic Validation


Force the LLM to return valid structured data using Pydantic:


`from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, List
import requests
from bs4 import BeautifulSoup

client = OpenAI()

class ProductReview(BaseModel):
 rating: float = Field(ge=1, le=5)
 text: str
 author: str
 date: Optional[str]
 verified_purchase: bool = False

class ProductData(BaseModel):
 name: str
 price: Optional[float]
 currency: str = "USD"
 rating: Optional[float] = Field(None, ge=0, le=5)
 review_count: Optional[int]
 availability: Optional[str]
 features: List[str] = []
 top_reviews: List[ProductReview] = []

def extract_product_structured(url: str) -> ProductData:
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 text = soup.get_text(separator='\n', strip=True)[:12000]

 # Use OpenAI's structured output feature
 response = client.beta.chat.completions.parse(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": "Extract product information from this webpage. Include top 3 reviews if available."},
 {"role": "user", "content": text}
 ],
 response_format=ProductData
 )

 return response.choices[0].message.parsed

product = extract_product_structured("https://www.amazon.com/dp/...")
print(f"Product: {product.name}")
print(f"Price: ${product.price}")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")`


Enter fullscreen mode
 


 Exit fullscreen mode


## Method 3: Claude for Long Documents


Claude handles larger contexts better — useful for long articles, reports, or multi-page PDFs:


`import anthropic
import requests
from bs4 import BeautifulSoup
import json

client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var

def extract_with_claude(url: str, extraction_prompt: str) -> str:
 """
 Extract information from a webpage using Claude.
 Better for: long pages, nuanced extraction, natural language output.
 """
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')
 for tag in soup.find_all(['script', 'style', 'noscript']):
 tag.decompose()

 # Claude can handle up to 200K tokens — much more headroom
 text = soup.get_text(separator='\n', strip=True)[:50000]

 message = client.messages.create(
 model="claude-haiku-3-5", # Cheapest Claude model
 max_tokens=2000,
 messages=[{
 "role": "user",
 "content": f"{extraction_prompt}\n\nWebpage content:\n{text}"
 }]
 )

 return message.content[0].text

# Example: Extract pricing table from a SaaS pricing page
pricing = extract_with_claude(
 "https://example-saas.com/pricing",
 """Extract the pricing information and return it as a JSON array.
 Each plan should have: name, monthly_price, annual_price, features (list).
 Return ONLY the JSON, no other text."""
)
print(json.loads(pricing))`


Enter fullscreen mode
 


 Exit fullscreen mode


## Method 4: Async Batch Extraction (Production Scale)


For scraping 100+ pages cost-effectively:


`import asyncio
import aiohttp
from openai import AsyncOpenAI
from bs4 import BeautifulSoup
import json
from typing import List, Dict

client = AsyncOpenAI()

async def fetch_page(session: aiohttp.ClientSession, url: str) -> str:
 """Fetch a page asynchronously."""
 headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
 async with session.get(url, headers=headers) as response:
 html = await response.text()
 soup = BeautifulSoup(html, 'html.parser')
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 return soup.get_text(separator='\n', strip=True)[:8000]

async def extract_from_page(text: str, schema: dict) -> dict:
 """Extract structured data using GPT-4o-mini."""
 response = await client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": f"Return ONLY a JSON object matching: {json.dumps(schema)}"},
 {"role": "user", "content": text}
 ],
 temperature=0
 )
 try:
 return json.loads(response.choices[0].message.content)
 except:
 return {}

async def batch_extract(urls: List[str], schema: dict, concurrency: int = 5) -> List[Dict]:
 """Extract data from many URLs concurrently."""
 semaphore = asyncio.Semaphore(concurrency)

 async def process_url(session, url):
 async with semaphore:
 text = await fetch_page(session, url)
 result = await extract_from_page(text, schema)
 result['_url'] = url
 return result

 async with aiohttp.ClientSession() as session:
 tasks = [process_url(session, url) for url in urls]
 results = await asyncio.gather(*tasks, return_exceptions=True)

 return [r for r in results if isinstance(r, dict)]

# Usage: extract from 100 product pages
urls = ["https://shop.example.com/product/1", "https://shop.example.com/product/2", ...]
schema = {"name": "str", "price": "float", "in_stock": "bool"}

results = asyncio.run(batch_extract(urls, schema, concurrency=5))
print(f"Extracted {len(results)} products")

# Cost for 100 pages at ~$0.0002/page = $0.02`


Enter fullscreen mode
 


 Exit fullscreen mode


## Hybrid Approach: Fast + Accurate


Use CSS selectors first, fall back to LLM only when they fail:


`import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json

client = OpenAI()

SELECTORS = {
 'amazon': {
 'name': '#productTitle',
 'price': '.a-price-whole',
 'rating': '.a-icon-alt',
 },
 'generic': None # Fall back to LLM
}

def extract_price_hybrid(url: str) -> dict:
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')

 # Try fast CSS selector first
 if 'amazon.com' in url:
 selectors = SELECTORS['amazon']
 result = {}
 all_found = True

 for field, selector in selectors.items():
 elem = soup.select_one(selector)
 if elem:
 result[field] = elem.text.strip()
 else:
 all_found = False
 break

 if all_found:
 return {'method': 'css', 'data': result}

 # Fallback to LLM
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 text = soup.get_text(separator='\n', strip=True)[:10000]

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": 'Return ONLY JSON: {"name": "str", "price": "float", "rating": "float"}'},
 {"role": "user", "content": text}
 ],
 temperature=0
 )

 try:
 return {'method': 'llm', 'data': json.loads(response.choices[0].message.content)}
 except:
 return {'method': 'failed', 'data': {}}

result = extract_price_hybrid("https://amazon.com/dp/...")
print(f"Method: {result['method']}, Data: {result['data']}")`


Enter fullscreen mode
 


 Exit fullscreen mode


## Cost Comparison by Volume


Volume
CSS-only
GPT-4o-mini
GPT-4o
Claude Haiku


100 pages
$0
$0.02
$0.50
$0.03


1,000 pages
$0
$0.20
$5.00
$0.30


10,000 pages
$0
$2.00
$50.00
$3.00


100,000 pages
$0
$20.00
$500.00
$30.00


Practical rule: Use CSS selectors for sites you control or know well. Use LLMs for:


- Unknown site structures (one-off extractions)

- Semantic extraction (summaries, classifications)

- Fallback when selectors break


## Related Articles


- Python Web Scraping Tutorial for Beginners 2026 — Traditional scraping foundation

- Web Scraping Tools Comparison 2026 — When to use which tool

- How to Validate LLM Outputs in Production — Validating extracted data


Want ready-to-use web scraping tools without the setup hassle?


The $29 Apify Scrapers Bundle includes 35+ production-ready scrapers — Google SERP, LinkedIn, Amazon, TikTok Shop, contact info, and more. One-time payment, instant download.


👉 Get the Bundle ($29)


## Related Tools


- ai-news-summarizer

- b2b-review-intelligence*

 if match:
 return json.loads(match.group(1))
 raise

# Usage
product_data = extract_with_gpt(
 "https://www.amazon.com/dp/B09G9FPHY6",
 schema={
 "product_name": "str",
 "price": "float",
 "rating": "float", 
 "reviews_count": "int",
 "availability": "str",
 "brand": "str"
 }
)

print(json.dumps(product_data, indent=2))
# {
# "product_name": "Echo Dot (5th Gen)",
# "price": 49.99,
# "rating": 4.7,
# "reviews_count": 123456,
# "availability": "In Stock",
# "brand": "Amazon"
# }`


Enter fullscreen mode
 


 Exit fullscreen mode


Cost estimate: GPT-4o-mini at ~3000 tokens/page = ~$0.0002/page = $0.20 per 1000 pages.


## Method 2: Structured Output with Pydantic Validation


Force the LLM to return valid structured data using Pydantic:


`from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, List
import requests
from bs4 import BeautifulSoup

client = OpenAI()

class ProductReview(BaseModel):
 rating: float = Field(ge=1, le=5)
 text: str
 author: str
 date: Optional[str]
 verified_purchase: bool = False

class ProductData(BaseModel):
 name: str
 price: Optional[float]
 currency: str = "USD"
 rating: Optional[float] = Field(None, ge=0, le=5)
 review_count: Optional[int]
 availability: Optional[str]
 features: List[str] = []
 top_reviews: List[ProductReview] = []

def extract_product_structured(url: str) -> ProductData:
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 text = soup.get_text(separator='\n', strip=True)[:12000]

 # Use OpenAI's structured output feature
 response = client.beta.chat.completions.parse(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": "Extract product information from this webpage. Include top 3 reviews if available."},
 {"role": "user", "content": text}
 ],
 response_format=ProductData
 )

 return response.choices[0].message.parsed

product = extract_product_structured("https://www.amazon.com/dp/...")
print(f"Product: {product.name}")
print(f"Price: ${product.price}")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")`


Enter fullscreen mode
 


 Exit fullscreen mode


## Method 3: Claude for Long Documents


Claude handles larger contexts better — useful for long articles, reports, or multi-page PDFs:


`import anthropic
import requests
from bs4 import BeautifulSoup
import json

client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var

def extract_with_claude(url: str, extraction_prompt: str) -> str:
 """
 Extract information from a webpage using Claude.
 Better for: long pages, nuanced extraction, natural language output.
 """
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')
 for tag in soup.find_all(['script', 'style', 'noscript']):
 tag.decompose()

 # Claude can handle up to 200K tokens — much more headroom
 text = soup.get_text(separator='\n', strip=True)[:50000]

 message = client.messages.create(
 model="claude-haiku-3-5", # Cheapest Claude model
 max_tokens=2000,
 messages=[{
 "role": "user",
 "content": f"{extraction_prompt}\n\nWebpage content:\n{text}"
 }]
 )

 return message.content[0].text

# Example: Extract pricing table from a SaaS pricing page
pricing = extract_with_claude(
 "https://example-saas.com/pricing",
 """Extract the pricing information and return it as a JSON array.
 Each plan should have: name, monthly_price, annual_price, features (list).
 Return ONLY the JSON, no other text."""
)
print(json.loads(pricing))`


Enter fullscreen mode
 


 Exit fullscreen mode


## Method 4: Async Batch Extraction (Production Scale)


For scraping 100+ pages cost-effectively:


`import asyncio
import aiohttp
from openai import AsyncOpenAI
from bs4 import BeautifulSoup
import json
from typing import List, Dict

client = AsyncOpenAI()

async def fetch_page(session: aiohttp.ClientSession, url: str) -> str:
 """Fetch a page asynchronously."""
 headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
 async with session.get(url, headers=headers) as response:
 html = await response.text()
 soup = BeautifulSoup(html, 'html.parser')
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 return soup.get_text(separator='\n', strip=True)[:8000]

async def extract_from_page(text: str, schema: dict) -> dict:
 """Extract structured data using GPT-4o-mini."""
 response = await client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": f"Return ONLY a JSON object matching: {json.dumps(schema)}"},
 {"role": "user", "content": text}
 ],
 temperature=0
 )
 try:
 return json.loads(response.choices[0].message.content)
 except:
 return {}

async def batch_extract(urls: List[str], schema: dict, concurrency: int = 5) -> List[Dict]:
 """Extract data from many URLs concurrently."""
 semaphore = asyncio.Semaphore(concurrency)

 async def process_url(session, url):
 async with semaphore:
 text = await fetch_page(session, url)
 result = await extract_from_page(text, schema)
 result['_url'] = url
 return result

 async with aiohttp.ClientSession() as session:
 tasks = [process_url(session, url) for url in urls]
 results = await asyncio.gather(*tasks, return_exceptions=True)

 return [r for r in results if isinstance(r, dict)]

# Usage: extract from 100 product pages
urls = ["https://shop.example.com/product/1", "https://shop.example.com/product/2", ...]
schema = {"name": "str", "price": "float", "in_stock": "bool"}

results = asyncio.run(batch_extract(urls, schema, concurrency=5))
print(f"Extracted {len(results)} products")

# Cost for 100 pages at ~$0.0002/page = $0.02`


Enter fullscreen mode
 


 Exit fullscreen mode


## Hybrid Approach: Fast + Accurate


Use CSS selectors first, fall back to LLM only when they fail:


`import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json

client = OpenAI()

SELECTORS = {
 'amazon': {
 'name': '#productTitle',
 'price': '.a-price-whole',
 'rating': '.a-icon-alt',
 },
 'generic': None # Fall back to LLM
}

def extract_price_hybrid(url: str) -> dict:
 r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
 soup = BeautifulSoup(r.text, 'html.parser')

 # Try fast CSS selector first
 if 'amazon.com' in url:
 selectors = SELECTORS['amazon']
 result = {}
 all_found = True

 for field, selector in selectors.items():
 elem = soup.select_one(selector)
 if elem:
 result[field] = elem.text.strip()
 else:
 all_found = False
 break

 if all_found:
 return {'method': 'css', 'data': result}

 # Fallback to LLM
 for tag in soup.find_all(['script', 'style']):
 tag.decompose()
 text = soup.get_text(separator='\n', strip=True)[:10000]

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": 'Return ONLY JSON: {"name": "str", "price": "float", "rating": "float"}'},
 {"role": "user", "content": text}
 ],
 temperature=0
 )

 try:
 return {'method': 'llm', 'data': json.loads(response.choices[0].message.content)}
 except:
 return {'method': 'failed', 'data': {}}

result = extract_price_hybrid("https://amazon.com/dp/...")
print(f"Method: {result['method']}, Data: {result['data']}")`


Enter fullscreen mode
 


 Exit fullscreen mode


## Cost Comparison by Volume


Volume
CSS-only
GPT-4o-mini
GPT-4o
Claude Haiku


100 pages
$0
$0.02
$0.50
$0.03


1,000 pages
$0
$0.20
$5.00
$0.30


10,000 pages
$0
$2.00
$50.00
$3.00


100,000 pages
$0
$20.00
$500.00
$30.00


Practical rule: Use CSS selectors for sites you control or know well. Use LLMs for:


- Unknown site structures (one-off extractions)

- Semantic extraction (summaries, classifications)

- Fallback when selectors break


## Related Articles


- Python Web Scraping Tutorial for Beginners 2026 — Traditional scraping foundation

- Web Scraping Tools Comparison 2026 — When to use which tool

- How to Validate LLM Outputs in Production — Validating extracted data


Want ready-to-use web scraping tools without the setup hassle?


The $29 Apify Scrapers Bundle includes 35+ production-ready scrapers — Google SERP, LinkedIn, Amazon, TikTok Shop, contact info, and more. One-time payment, instant download.


👉 Get the Bundle ($29)


## Related Tools


- ai-news-summarizer

- b2b-review-intelligence*

Original source

DEV Community

https://dev.to/vhub_systems_ed5641f65d59/using-gpt-4-and-claude-to-extract-structured-data-from-any-webpage-in-2026-nn

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelavailable

ProductsLive

eM Client Adds Generative AI Features - Let's Data Science

eM Client Adds Generative AI Features Let's Data Science

Google News: Generative AI

1m19 minutes ago

ModelsFresh

Gemma 4 Architecture Comparison

Flagship open-weight release days are always exciting. Was just reading through the Gemma 4 reports, configs, and code, and here are my takeaways: Architecture-wise, besides multi-model support, Gemma 4 (31B) looks pretty much unchanged compared to Gemma 3 (27B). Link to the comparison page: https://sebastianraschka.com/llm-architecture-gallery/?compare=gemma-3-27b 2Cgemma-4-31b Gemma 4 maintains a relatively unique Pre- and Post-norm setup and remains relatively classic, with a 5:1 hybrid attention mechanism combining a sliding-window (local) layer and a full-attention (global) layer. https://preview.redd.it/7bn493789zsg1.png?width=1444 format=png auto=webp s=4b28421ed276cb0b1ba133e3c325d446d68ea1ef The attention mechanism itself is also classic Grouped Query Attention (GQA). But let’s no

Reddit r/LocalLLaMA

2mabout 6 hours ago

Open Source AIFresh

Gemma 4 is great at real-time Japanese - English translation for games

When Gemma 3 27B QAT IT was released last year, it was SOTA for local real-time Japanese-English translation for visual novel for a while. So I want to see how Gemma 4 handle this use case. Model: Unsloth's gemma-4-26B-A4B-it-UD-Q5_K_M Context: 8192 Reasoning: OFF Softwares: Front end: Luna Translator Back end: LM Studio Workflow: Luna hooks the dialogue and speaker's name from the game. A Python script structures the hooked text (add name, gender). Luna sends the structured text and a system prompt to LM Studio Luna shows the translation. What Gemma 4 does great: Even with reasoning disabled, Gemma 4 follows instructions in system prompt very well. With structured text, gemma 4 deals with pronouns well. This is one of the biggest challenges because Japanese spoken dialogue often omit subj

Reddit r/LocalLLaMA

2mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 183 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Gemma 4 Architecture Comparison

Reddit r/LocalLLaMA

2mabout 6 hours ago

ModelsLive

Smaller models are getting scary good.

I am still processing this lol. I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled). Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal. The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken. I've attac

Reddit r/LocalLLaMA

1mabout 1 hour ago

ModelsLive

[Appreciation Post] Gemma 4 E2B. My New Daily Driver 😁

idk but this thing feels like magic in the palm of my hands. I am running it on my Pixel 10 Pro with AI Edge Gallery by Google. The phone itself is only using CPU acceleration for some reason and therefore the E4B version felt a little to slow. However, with the E2B it runs perfect. Faster than I can read and follow along and has some function calling in the app. I am running it at the max 32K context and switch thinking on and off when I need. It seem ridiculously intelligent. Feels like a 7b model. I'm sure there is some recency bias here. But just having it run at the speed it does on my phone with it's intelligence feels special. Are you guys having a good experience with the E models? submitted by /u/Prestigious-Use5483 [link] [comments]

Reddit r/LocalLLaMA

1mabout 1 hour ago

ModelsFresh

Gemma-4-31B NVFP4 inference numbers on 1x RTX Pro 6000

Ran a quick inference sweep on gemma 4 31B in NVFP4 (using nvidia/Gemma-4-31B-IT-NVFP4 ). The NVFP4 checkpoint is 32GB, half of the BF16 size from google (63GB), likely a mix of BF16 and FP4 roughly equal to FP8 in size. This model uses a ton of VRAM for kv cache. I dropped the kv cache precision to FP8. All numbers are steady-state averages under sustained load using locust and numbers below are per-user metrics to show user interactivity. 1K output. vLLM. Per-User Generation Speed (tok/s) Context 1 User 2 Users 3 Users 4 Users 1K 40.7 36.6 36.1 35.1 8K 39.9 36.5 34.8 32.7 32K 40.5 28.9 25.3 23.5 64K 44.5 27.4 26.7 14.3 96K 34.4 19.5 12.5 9.5 128K 38.3 - - - Time to First Token Context 1 User 2 Users 3 Users 4 Users 1K 0.1s 0.1s 0.2s 0.2s 8K 1.0s 1.4s 1.7s 2.0s 32K 5.5s 8.1s 10.0s 12.6s 6

Reddit r/LocalLLaMA

2mabout 2 hours ago