Grok 4 is Here and it's Simply Brilliant! - Analytics Vidhya
Jul 12, 2025 am 09:14 AM“It’s smarter than almost all graduate students in all disciplines – Elon Musk.”
Elon Musk and his Grok team are back with their latest and best model to date: Grok 4. It was only 3 months ago that this team of experts launched Grok 3, a model that still competes with the giants from OpenAI, Gemini, and Anthropic. But with Grok 4, Elon Musk is giving these companies a run for their money. Grok 4 comes with superhuman-level thinking and reasoning capabilities. With tools and agents in its arsenal, it brings a better understanding of the world, both personal and professional. In this blog, we’ll explore everything about Grok 4: its features, capabilities, benchmarks, and finally, we’ll test it.
Let’s Grok it!
Table of contents
- What is Grok 4?
- Key Features
- Availability
- How to Access Grok 4?
- Grok 4 in Action
- Task 1: Solving a PhD-level Question
- Task 2: Performing a Multistep Research
- Task 3: Doing Coding with Context
- Grok 4 Benchmarks
- ARC-AGI
- Vending Bench
- Applications of Grok 4
- Grok 3 vs Grok 4
- Conclusion
What is Grok 4?
Grok 4 is the latest multi-modal large language model (LLM) from Elon Musk’s company, x.ai. It has 100 times more training data than Grok 2 (the first public model by x.ai) and 10 times more reinforcement learning compute than any other model available. Grok 4 features a 256K context window, real-time data search, advanced voice capabilities, agentic abilities, and intelligence that closely mimics human behavior.
Grok 4 has two versions:
- Normal Version: This is the single-agent version of the Grok 4 LLM. It features agentic behavior, where one agent works to solve your problems. This model is useful for daily tasks involving language, search, coding, and more. It’s available in the Super Grok plan offered by x.ai and also via API for developers.
- Grok 4 Heavy: This is the multi-agent version of Grok 4. When prompted, multiple agents collaborate, compare outcomes, and generate the best result. It’s ideal for complex reasoning, deep analysis, and research. It is available only under the Super Grok Heavy plan by x.ai.
Key Features
- It’s an Academic Whiz:?Grok 4 shines on the Humanity’s Last Exam (HLE) benchmark. Out of 2,500 questions spanning math, physics, chemistry, humanities, and computer science, it scored double digits on half! Most current models manage only low single digits, suggesting Grok 4 can tackle PhD-level problems across disciplines.
- Tool Use:?Grok 4 has been trained natively on tool use, outperforming Grok 3’s research tools. With extensive scaling and compute, it can handle even the toughest text-based problems.
- Its design is Agentic: The Grok 4 models are agentic. With single and multiple agents working behind the scenes, these models can swiftly perform multiple tasks.
- Its enhanced voice capabilities: The Grok 4 models come with an advanced voice mode that sounds more personal and calm compared to the other models from Open AI and Gemini. It comes with a new voice, “Eve” – a British speaker that can quickly switch from singing to whispering, mimicking human-like emotions. Along with this, the latency of their latest voice mode has been reduced by half, compared to its previous version.
- It can run a business: The Grok 4 models can reason out like humans and take decisive decisions, strategise, and plan in a way that makes them capable of running a business. Infact, they might just help you make some profit too.
When it comes to multimodal capabilities, especially image analysis and generation, Grok 4 models currently perform poorer than the top models like o3, Gemini 2.4 Pro, Claude 4, etc. Although this may improve significantly in the coming few days (or weeks).
Availability
- Super Grok:?Includes Grok 4 and Grok 3. Comes with a 128K token window, voice and vision capabilities. Priced at $30/month or $300/year.
- Super Grok Heavy:?Includes Grok 4 Heavy and Grok 4. Offers an enhanced context window and early access to new features. This premium plan costs $300/month or $3,000/year, comparable to OpenAI’s and Google’s premium tiers.
How to Access Grok 4?
To access Grok 4 on chat:
- Head to Grok.?
- Log in to your Super Grok account.
- In the chatbox in the middle of the screen and click on the small model dropdown at the corner of the chatbox.
- Select the “Grok 4” model
- Once done, you can get started.
To access Grok 4 on the API:
- Go to https://x.ai/api and click on API Console Login.
- Click on API Keys.
- Click on Create API key and after that give a name to your api key and click on Save to generate your grok api key.
- Now to access the Grok 4 using api endpoints, visit https://docs.x.ai/docs/models/grok-4-0709 and use the below code snippet to access it.
from xai_sdk import Client from xai_sdk.chat import user, system client = Client( api_host="api.x.ai", api_key="<your_xai_api_key_here>" ) chat = client.chat.create(model='grok-4-0709', temperature=0) chat.append(system("You are a PhD-level mathematician.")) chat.append(user("What is 2 2?")) response = chat.sample() print(response.content)</your_xai_api_key_here>
Grok 4 in Action
Now that we’ve read all about Grok 4, it’s time to see if it brings in the punch as it claims. To do this, we will test Grok 4 on the following tasks:
- PhD-level Question to test their reasoning capabilities
- Multi-step research to check its agentic capabilities
- Coding with context to test its real-world use capabilities
Let’s start.
Task 1: Solving a PhD-level Question
Result:
Analysis:
Grok 4 approached the problem step-by-step, addressing each question in order. It correctly interpreted the prompt, reasoned through the solution, and even generated code for the graphs when asked. The visualizations were accurate and aligned with the explanation.
Task 2: Performing a Multistep Research
Prompt: “Tell me about Analytics Vidhya’s latest post on X and find the latest blog on their website – summarise information on them in 5 lines each.”
Result:
Analysis:
This task it performed better than I had imagined. The task itself is not difficult, but I see so many models struggling with the dates to accurately fetch the latest information. Grok 4 took only a few seconds. It went through the website and the Twitter page, found the latest information, and then reasoned it out to give me 5 concrete lines on each.
You can check it yourself on our blog page or X page.?
Task 3: Doing Coding with Context
Prompt: “Merge all these PDFs and create a single JSON file.”
Files
Result:
Analysis:
It started well, by listing down the content from a few files, and then began the hallucinations. All that I got in the result was a stream of #. So this was disappointing.
Prompt 2: “Convert the following code into Python and React”
Code File
Result:
Analysis:
Grok 4 was quick and pretty efficient, it quickly generated the code in Python and actually understood that with the “react” word in my prompt. I was looking forward to seeing the code for my app’s frontend. It then also presented the code for each section, making it simple for me to copy the required part as and when it is needed.?
Grok 4 Benchmarks
Grok 4 almost aced all of the benchmarks that we usually look at. Here is a summary:
- GPQA (Graduate-Level Physics Questions Archive): This benchmark test expert expert-level science knowledge. On this benchmark, Grok 4 achieves 87-88%, leading competitors like GPT-4o and Claude 3.5 Sonnet.
- AIME (American Invitational Mathematics Examination) 2025: This benchmark compares the mathematical prowess. Grok 4 scores 95%, with some reports claiming up to 100% dominance. This surpasses previous SOTA models.
- SWE-Bench (Software Engineering Benchmark): It evaluates coding and real-world software problem-solving (Grok 4 Code variant). Scores range from 72-75%, significantly ahead of o3-mini (high) and Claude 3.5 Sonnet.
- Other Math and Reasoning Benchmarks: Grok 4 dominates U.S. Mathematical Olympiad and Harvard-MIT Mathematics Tournament, and similar tests with massive gains over prior SOTA. It also excels in general reasoning and Ph.D.-level tasks across fields.
These are the usual benchmarks for testing any latest LLM. Grok 4 also came with its scorecard on two new benchmarks: ARC-AGI and Vending Bench.
ARC-AGI
This benchmark checks how close models are to achieving AGI, or artificial general intelligence. This is done by scoring their performance on different ARC-style tasks, which are a collection of challenging puzzles.
Grok 4 takes up the top spot, breaking the 10% barrier, meaning the model has taken its first steps into general reasoning. Claude Opus 4 models follow next and then come o3 (high), o4-mini(high), and others! This seems that Grok 4 is essentially closer to AGI than the rest of its peers.?
Vending Bench
This benchmark tests the agentic AI systems to measure how well these agents can interact with a real e-commerce website to complete complex tasks. It’s designed to stress test real-world decision making, planning, and UI interaction.
Grok 4 excels in this too, beating some human, Claude 4, Opus, and Gemini 2.5 Pro and o3.
Infact, the Grok 4 was tested to run an actual vending machine to test this, and it incurred huge profits while doing so. Anthropic had released something similar about Claude running a vending machine a few days back, and in that, they had mentioned that the machine ran into a loss!
Applications of Grok 4
Grok 4 comes with a great set of features and performance benchmarks, based on which it can be pretty useful for:
- Real-Time Social Media Interaction: It is integrated directly into X (formerly Twitter) as a chatbot. It can be used to generate memes, posts, polls, summaries, or sentiment analysis.
- Advanced Research: It can solve PhD-level questions, thus indicating that it can truly contribute to advanced research in mathematics, physics, and engineering.
- Business Planning: It can help to map out strategies and perform advanced business analysis to help you get actionable insights.
- Coding and Writing: Grok 4 comes with brilliant SWE benchmarks and agentic capabilities, thus it can take up many coding tasks and perform them well too.
Grok 3 vs Grok 4
Although Grok 3 has been in the spotlight for its racist comments, with Grok 4, the team is looking to do more than just damage control. Grok 4 comes with tool use integrated from the start, and the Grok team plans to upgrade this to “commercial grade” capabilities, helping you solve actual, real-world problems. Along with this, we can expect Grok 4 to master video and image analysis and generation very soon, bringing us closer to experiencing playable AI-generated video games and fully AI-generated shows.
Conclusion
Is Grok 4 a big deal? Definitely. In a market that feels increasingly saturated, it stands out as a breath of fresh air, offering real improvements over its predecessors. With actual use cases emerging, it seems poised to help solve many everyday problems. Both standard and Heavy variants are agentic, fast, and significantly better at reasoning. While some suggest it’s built for AGI, I believe there’s still time and room for growth. Grok 3 also launched with great promise but later went off track. With this new release, it’s just the beginning, much testing is still needed to understand its true potential.
The above is the detailed content of Grok 4 is Here and it's Simply Brilliant! - Analytics Vidhya. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Google’s NotebookLM is a smart AI note-taking tool powered by Gemini 2.5, which excels at summarizing documents. However, it still has limitations in tool use, like source caps, cloud dependence, and the recent “Discover” feature

But what’s at stake here isn’t just retroactive damages or royalty reimbursements. According to Yelena Ambartsumian, an AI governance and IP lawyer and founder of Ambart Law PLLC, the real concern is forward-looking.“I think Disney and Universal’s ma

Here are ten compelling trends reshaping the enterprise AI landscape.Rising Financial Commitment to LLMsOrganizations are significantly increasing their investments in LLMs, with 72% expecting their spending to rise this year. Currently, nearly 40% a

Using AI is not the same as using it well. Many founders have discovered this through experience. What begins as a time-saving experiment often ends up creating more work. Teams end up spending hours revising AI-generated content or verifying outputs

Space company Voyager Technologies raised close to $383 million during its IPO on Wednesday, with shares offered at $31. The firm provides a range of space-related services to both government and commercial clients, including activities aboard the In

I have, of course, been closely following Boston Dynamics, which is located nearby. However, on the global stage, another robotics company is rising as a formidable presence. Their four-legged robots are already being deployed in the real world, and

Add to this reality the fact that AI largely remains a black box and engineers still struggle to explain why models behave unpredictably or how to fix them, and you might start to grasp the major challenge facing the industry today.But that’s where a

Nvidia has rebranded Lepton AI as DGX Cloud Lepton and reintroduced it in June 2025. As stated by Nvidia, the service offers a unified AI platform and compute marketplace that links developers to tens of thousands of GPUs from a global network of clo
