📰 AI News Daily — 26 Dec 2025

— TL;DR (Top 5 Highlights) —

Chip race heats up: Samsung taps Biren’s founder; reports point to an Nvidia–Groq talent/tech deal.
Open‑weights momentum: GLM 4.7 tops open models; smaller RL‑tuned models set new marks.
Evaluation gets easier: Anthropic’s Bloom and modular “Agent Skills” push reliability and interoperability.
Research watch: PoPE fix for RoPE, traceable reasoning checks, and new human–LLM judgment gaps.
Monetization and safety: OpenAI weighs ads as usage surges; deepfake detection and crackdowns intensify; a 300TB music archive raises copyright alarms.

🛠️ New Tools

Anthropic Bloom (open‑source) automates behavioral test creation and scoring, dramatically cutting manual eval work and helping teams track honesty, robustness, and safety traits before models reach production users.
Anthropic Agent Skills debuts an open standard for modular agent capabilities, improving portability and reuse; OpenAI is trialing a similar “Skills” framework, signaling a shift from model silos to agent ecosystems.
New agent analysis tool surfaces failure modes during development, helping teams catch brittle plans, unsafe actions, and tool‑use errors before they become costly user‑facing bugs.
vLLM adds FunctionGemma support with a custom parser, enabling smoother token streaming and more reliable function/tool calling for production‑grade agent workflows.
Kling 2.6 Motion Control boosts cinematic, full‑body movement and natural expressions; early showdowns rate it highly, and a community challenge incentivizes experimentation and best‑practice sharing.
Qwen Image Edit arrives in ComfyUI with layered edits, giving creators finer control for multi‑step visual changes without destructive workflows.

GLM 4.7 climbs to #2 on Website and Design Arena leaderboards—leading all open‑weights and trailing only the latest Gemini—signaling rapid gains in practical reasoning and UX.
LFM2‑2.6B‑Exp (RL‑trained) sets a new bar for ~3B models, outperforming larger peers on instruction following, knowledge recall, and math—evidence that technique can trump sheer parameter count.
MiniMax M2.1 earns praise for multilingual coding and reasoning, offering strong small‑model performance that lowers inference costs without sacrificing utility.
Trained A3B and 8B weights land on Hugging Face, expanding practical access for teams shipping lightweight assistants and on‑device experiences.
The Poetiq system, using a reported GPT‑5.2 X‑High, claims up to 75% on ARC‑AGI‑2 at low cost—about a 15‑point jump—highlighting efficient reasoning pipelines.
GLM 4.7 integrates with FactoryAI’s Droid, pointing to richer orchestration and deployment options for automation and enterprise agent stacks.

A preprint identifies seven core gaps between human and LLM judgment—covering nuance, uncertainty, and social reasoning—guiding evaluation design and risk assessment for high‑stakes deployments.
OpenAI proposes pre‑action “reasoning traceability” checks, assessing how well a model’s thought process can be inspected before it acts—useful for safety‑critical tools and regulated workflows.
Multiple papers flag a RoPE positional‑encoding flaw; a simple PoPE tweak improves long‑context stability and recall, offering an easy gain for existing architectures.
Researchers use LLMs to generate and refine systems algorithms, speeding prototyping and enabling broader exploration—suggesting a new co‑design loop for compilers, schedulers, and protocols.
A Stanford–Harvard study finds impressive “agentic AI” demos often underperform in the wild, urging more context‑aware, realistic testing before broad deployment.
At the Marine Biological Laboratory, AI‑driven visualization unlocks analysis of massive brain datasets, deepening insight into how long‑term memories form and are stored.

Samsung hires Biren’s founder to spearhead a next‑gen GPU push, underscoring intensifying competition across training and inference silicon.
Reports suggest Nvidia is moving on Groq via strategic licensing and key acqui‑hires, signaling a bid to consolidate strengths in ultra‑fast inference as demand soars.
Investors including Oracle, CoreWeave, and SoftBank lean in while chip startups (Cerebras, Etched) report rising valuations and clearer exit paths—fueling a robust hardware funding cycle.
OpenAI explores ads in ChatGPT (e.g., sponsored suggestions) to diversify revenue as adoption surges—57% of Americans used a chatbot last week—yet paid subscriptions remain under 10%.
A public 300TB music archive sparks fresh scrutiny of training data and copyright, raising questions for dataset governance, licensing norms, and fair‑use boundaries.
Google expands SynthID watermark checks for AI‑generated media, while US/UK lawmakers and platforms escalate deepfake enforcement—an arms race to curb abuse and protect victims.

The eggroll community releases a beginner‑friendly Colab notebook, lowering the barrier to explore its AI codebase with hands‑on walkthroughs, runnable examples, and clear pointers for first contributions.

A 24‑hour run of Claude Code autonomously built 500 projects, produced ~450,000 lines of code, and captured 1,500+ screenshots—showcasing rapid progress in long‑ and short‑term memory for coding agents.
Startup Enso executed a full ad campaign in six hours for $150, demonstrating how AI tooling compresses creative cycles from weeks to hours while preserving brand direction and iteration speed.

Leaders urge substance over hype: listening to users beats flashy demos, and ignoring feedback risks community trust—especially for early‑stage startups.
2025 picks for research and coding favor Gemini 3 and Opus 4.5, reflecting a split between frontier reasoning and dependable developer ergonomics.
Tooling debate: users like both AmpCode and FactoryAI; AmpCode’s threads, handoff, and sub‑agents win points for collaboration and task decomposition.
The AGI conversation shifts toward open‑endedness—discovery, creativity, and continual learning—rather than narrow benchmark wins.
Insights from the AIE World’s Fair: today’s AI still yields modest productivity gains for seasoned developers, highlighting integration costs and the value of better workflows.

Curated from 250+ RSS feeds, Twitter expert lists, Reddit, and Hacker News.