AI Minds Newsletter
Posts
Andrew Ng Announces Agentic Document Extraction, Google’s Transformer 2.0 Fixes Memory, and GPT 4.5 versus Sonnet 3.7

Andrew Ng Announces Agentic Document Extraction, Google’s Transformer 2.0 Fixes Memory, and GPT 4.5 versus Sonnet 3.7

Andrwe Ng announces agentic document extraction on Twitter. Google's Transformer 2.0 shows attention isn't all you need. GPT 4.5 tries to outsmart Sonnet 3.7 in a game.

Jose Nicholas Francisco
March 04, 2025

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In this edition:

🎥 Attention isn’t all you need: Google’s “Transformer 2.0” and AI Memory Recall
🧠 Researchers use deep learning to detect mental illness
🦊 Stereotyping animals with vision-language models
💻 Deepgram & Vonage Technical Webinar: How to build responsive voice agents
🛣️ Meet Deepgram at HumanX & NVIDIA GTC!
📲 Three new, trending AI apps for you!
📄 Andrew Ng announces Agentic Document Extraction
🐦 Social Media Buzz: Best code embedding model in the market
🚁 Drone uses machine learning to track its subjects with a camera
🎙️ AI Minds Podcast with Pablo Palafox, Co-Founder and CEO at HappyRobot
🤖 Bonus Video - Two AI agents play game OVER SOUND: Sonnet 3.7 vs GPT 4.5

Thanks for letting us crash your inbox; let’s party. 🎉

We coded with the brand-new Whisper-v3 over the past week, and the results were not what we expected. Check it out here!

🎥 Attention isn’t all you need: Google’s “Transformer 2.0” and AI Memory Recall

Video description: “In this video, [Bycloud] will be sharing the research that aims to solve the problem of context window, kv-cache, and memory recall efficiency. Even though the title only mentions Google's research, [Bycloud] also included research from Meta and Sakana AI. They paved a good way to introduce the idea of AI memory.”

Papers mentioned in the video:

🔍 Detecting Mental Illness with Deep Learning and Stereotyping Animals with vision-language AI

Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection - This tutorial provides practical guidance to address common challenges in applying machine learning and deep learning methods for mental health detection on platforms like social media. It focuses on strategies for working with diverse datasets, improving text preprocessing, and addressing issues such as imbalanced data and model evaluation.

Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models - This study investigates how animal stereotypes manifest in vision-language models during the task of image generation. Through targeted prompts, the authors explore whether DALL-E perpetuates stereotypical representations of animals, such as "owls as wise," "foxes as unfaithful," etc.

⚡ Technical Deep Dive: How to Build Responsive Voice Agents with Vonage & Deepgram

Learn how to build human-like voice agents for customer support, appointment scheduling and more in our March 26th technical webinar with Vonage.

When: Wednesday 26th March 2025, 10:00 PT / 12:00 ET / 17:00 GMT

Where: Online

⭐️ Save your spot here! ⭐️

Hosted by:

Benjamin Aronov, Developer Advocate at Vonage
Tony Chan, Senior Solutions Engineer at Vonage
Damien Murphy, Applied Engineer at Deepgram

🔊 Deepgram is Hitting the Road: HumanX & NVIDIA GTC

We’re gearing up for HumanX & NVIDIA GTC — two of the biggest AI events of the year. If you’re building with voice AI, stop by to see how our APIs can power real-time, scalable speech applications with low latency and high accuracy.

📍Find us here:

🚀 HumanX – Booth 825
🚀NVIDIA GTC – Booth 1709

Let’s meet onsite—grab time with our team!

The best *code embedding* model in the market right now was just released:
Qodo-Embed-1 — There are two flavors: A lite model with 1.5B parameters and a medium model with 7B parameters (Hugging Face links below).
If you want to index a large codebase (supports 10M+ lines of… x.com/i/web/status/1…
— Santiago (@svpino)
1:38 PM • Mar 3, 2025

Announcing: Agentic Document Extraction!
PDF files represent information visually - via layout, charts, graphs, etc. - and are more than just text. Unlike traditional OCR and most PDF-to-text approaches, which focus on extracting the text, an agentic approach lets us break a… x.com/i/web/status/1…
— Andrew Ng (@AndrewYNg)
6:47 PM • Feb 27, 2025

This drone uses machine learning on the device to track me and fly all by itself, all while avoiding trees and objects! Incredible.
— Marc Grabanski (@1Marc)
4:40 PM • Oct 9, 2021

Snapvid AI helps save time in the video editing process by adding subtitles and emojis in seconds. Additionally, you can insert video footage, transitions, and sound effects with just one click.

Lenso is a cutting-edge application designed to enhance productivity and streamline workflows. By leveraging advanced technology, Lenso offers users a seamless experience that caters to a wide array of needs. Whether you’re a professional looking to optimize your daily tasks or a team in need of effective collaboration tools, Lenso provides a comprehensive solution.

Mix Check Studio, powered by RoEx, is a cutting-edge platform designed to provide precise feedback and enhancement for audio mixes and masters. Utilizing advanced AI technology, Mix Check Studio aims to streamline the audio production process by offering users the ability to upload their tracks and receive detailed analysis and improvements. This tool is especially beneficial for musicians, producers, and audio engineers.

🎤 The AI Minds Podcast

This episode of the AI Minds Podcast features Pablo Palafox, Co-Founder and CEO at HappyRobot. HappyRobot automates communication across channels with AI workers that integrate with your systems, manage conversations, & log data.

He emphasizes the customer-centric approach they’ve taken, continuously refining their platform based on feedback from the logistics sector to ensure real-world value and address genuine business pain points.

🤖 Bonus Video: Two AI agents play game OVER SOUND: Sonnet 3.7 vs GPT 4.5

Video Description: “All the communication is happening exclusively by sound, using Gibberlink (powered by ggwave protocol). The bots were not programmed to play this game. One of the bots simply had an objective "play and win tic tac toe" in system prompt.”