Grok - AI Language Model Assistant Released by Elon Musk

Grok is a large artificial intelligence model launched by Elon Musk’s xAI company on November 5, 2023. It is an AI designed to mimic "The Hitchhiker's Guide to the Galaxy" and thus can answer almost any question, including most sharp questions that other artificial intelligence systems refuse. Impressively, it can even suggest what questions to ask! It has been trained to be quite rebellious and prefers to answer questions humorously, which might be to fulfill Elon's vision of "WokeGPT". Grok’s pre-training data goes up to the third quarter of 2023 and it has access to search tools and real-time information, meaning Grok can access real-time information on 𝕏 "Twitter", which is a significant advantage since 𝕏 "Twitter" has the latest news updates.

In all benchmark tests, Grok-1 showed robust results, surpassing all other models in its computational category. To ensure the objectivity of the evaluation, the final exam of the Hungarian national high school mathematics administered at the end of May 2023, after the data collection for training, was used for manual scoring of the model. Grok passed the exam with a C grade (59%), while Claude-2 also received a similar score (55%), and GPT-4 achieved a B with a score of 68%.

See It in Action

Elon has indicated that Grok will be accessible to all X Premium+ subscribers upon the completion of its beta phase. For now, access is limited to a handful of users, however, you can sign up for the waiting list through the following method:

Step one, open the official website ,and click the "Sign in with X" button, log in with your 𝕏 "Twitter" account.

Sign in to begin your experience.

Step two, click the "Authorize app" button to grant Grok access to your 𝕏 "Twitter" account.

Authorize

Step three, enter your email address and wait to receive a notification for the trial experience.

Receive experience notifications via email.

Features

Grok also has many innovations in terms of operational interaction, such as multitasking concurrency, dialogue branching, and more. Here are the detailed features:

Performance Comparison

Our team has been diligently working on our leading-edge language model, Grok-1, for the past four months, and it has evolved through numerous versions during this period.

The journey began with Grok-0, a prototype language model with 33 billion parameters, which we introduced after announcing xAI.
Grok-0 came close to matching the performance of LLaMA 2, which has 70 billion parameters, on conventional language model benchmarks, despite only requiring half the amount of training resources.
In the following two months, we have made substantial strides, notably in reasoning and programming tasks.
This progress has led to the inception of Grok-1, our cutting-edge language model that markedly outperforms its predecessors, achieving a score of 63.2% on the HumanEval coding benchmark and 73% on the MMLU.

In order to precisely quantify the enhancements in Grok-1, we've employed several standard machine learning benchmarks that are designed to test mathematical and reasoning capabilities.

GSM8k:This benchmark involves middle school math word problems (outlined by Cobbe et al., 2021) and is approached using the chain-of-thought prompting strategy.

MMLU:This consists of multidisciplinary multiple-choice questions (described by Hendrycks et al., 2021), with Grok-1 being given five examples in-context for each problem.

HumanEval:A Python coding completion challenge (introduced by Chen et al., 2021), in which Grok-1 was evaluated in a zero-shot manner, specifically focusing on the pass@1 rate.

MATH:A series of middle school and high school math problems (compiled by Hendrycks et al., 2021) presented in LaTeX and prompted to Grok-1 with a fixed four-shot method."

Benchmark Grok-0 (33B) LLaMa 2 70B Inflection-1 GPT-3.5 Grok-1 Palm 2 Claude 2 GPT-4
GSM8k 56.8%
8-shot
56.8%
8-shot
62.9%
8-shot
57.1%
8-shot
62.9%
8-shot
80.7%
8-shot
88.0%
8-shot
92.0%
8-shot
MMLU 65.7%
5-shot
68.9%
5-shot
72.7%
5-shot
70.0%
5-shot
73.0%
5-shot
78.0%
5-shot
75.0%
5-shot + CoT
86.4%
5-shot
HumanEval 39.7%
0-shot
29.9%
0-shot
35.4%
0-shot
48.1%
0-shot
63.2%
0-shot
- 70%
0-shot
67%
0-shot
MATH 15.7%
4-shot
13.5%
4-shot
16.0%
4-shot
23.5%
4-shot
23.9%
4-shot
34.6%
4-shot
- 42.5%
4-shot

In the referenced performance evaluations, Grok-1 emerged as a top contender, outshining all its contemporaries within its computational category, including notable entities like ChatGPT-3.5 and Inflection-1. Its performance was exceeded only by architectures with substantially greater training datasets and computational power, such as GPT-4, illustrating xAI's strides in developing highly efficient Large Language Models (LLMs).

Given the accessibility of these benchmarks online, there is a possibility that our models have been exposed to them during training. To mitigate this and conduct a more stringent assessment, we manually scored our model alongside Claude-2 and GPT-4 using the mathematics section of the 2023 Hungarian national high school finals , released post-data collection in late May. Grok obtained a 'C' grade with 59%, Claude-2 also earned a 'C' with 55%, and GPT-4 secured a 'B' with a score of 68%. All models were appraised at a temperature setting of 0.1 using identical prompts. It's pivotal to acknowledge that there was no specific optimization conducted for this particular test. This served as an empirical assessment on fresh material that was not part of the model’s initial training regime.

Human-graded evaluation Grok-0 GPT-3.5 Claude 2 Grok-1 GPT-4
Hungarian National High School Math Exam (May 2023) 37%
1-shot
41%
1-shot
55%
1-shot
59%
1-shot
68%
1-shot

Grok-1 model card

Model details Grok-1 is an autoregressive Transformer-based model pre-trained to perform next-token prediction. The model was then fine-tuned using extensive feedback from both humans and the early Grok-0 models. The initial Grok-1 has a context length of 8,192 tokens and is released in Nov 2023.
Intended uses Grok-1 is intended to be used as the engine behind Grok for natural language processing tasks including question answering, information retrieval, creative writing, and coding assistance.
Limitations While Grok-1 excels in information processing, it is crucial to have humans review Grok-1's work to ensure accuracy. The Grok-1 language model does not have the capability to search the web independently. Search tools and databases enhance the capabilities and factualness of the model when deployed in Grok. The model can still hallucinate, despite the access to external information sources.
Training data The training data used for the release version of Grok-1 comes from both the Internet up to Q3 2023 and the data provided by our AI Tutors.
Evaluation Grok-1 was evaluated on a range of reasoning benchmark tasks and on curated foreign mathematic examination questions. We have engaged with early alpha testers to evaluate a version of Grok-1 including adversarial testing. We are in the process of expanding our early adopters to close beta via Grok early access.

Advanced Technology

In the vanguard of deep learning innovation, the development of steadfast infrastructure is as crucial as the creation of datasets and the development of algorithms. For the construction of Grok, we crafted a bespoke stack for training and inference that leverages Kubernetes, Rust, and JAX.

The operation of LLM training is akin to a relentless locomotive; should a single carriage succumb to derailment, the entire procession risks being swept off its course, presenting significant challenges to its reclamation. There exists a plethora of GPU malfunctions: from factory flaws and unstable connections to improper setups, the decay of memory modules, and even sporadic bit flips, among others. In our training processes, we orchestrate synchronized operations across an expanse of tens of thousands of GPUs for prolonged durations, and at such a magnitude, these malfunctions transpire with alarming regularity. To combat these issues, we have designed and implemented specialized distributed systems that swiftly pinpoint and autonomously rectify each category of failure. At xAI, the optimization of compute efficiency per energy consumed has been at the forefront of our priorities. In recent times, our infrastructure advancements have substantially curtailed downtime and preserved a high Model Flop Utilization (MFU), regardless of hardware unpredictability.

The adoption of Rust has been exemplary for the crafting of scalable, dependable, and easily maintainable infrastructure. Its delivery of stellar performance, comprehensive ecosystem, and bug prevention — typically encountered in distributed systems — are unmatched. With a lean team, the dependability of our infrastructure is pivotal; without it, ongoing maintenance would eclipse innovation. Rust reassures us that modifications or overhauls in our code are more likely to yield functional, long-running programs that demand minimal oversight.

As we gear up for an upcoming surge in model capabilities, our focus shifts to orchestrating training across an even broader network of accelerators, managing data pipelines on a global scale, and integrating unprecedented functionalities and instruments within Grok.

Grok AI’s vision

At xAI, we want to create AI tools that assist humanity in its quest for understanding and knowledge.

By creating and improving Grok, we aim to:

Our ultimate goal is for our AI tools to assist in the pursuit of understanding.