# grog

# Groq Al Workshop

ALCF AI Testbed



# Agenda - Day 1

| Session                                             | Description                                                                                                                                            | Length  | Speaker                                      |
|-----------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------|
| Intro to ALCF                                       | Introduction to the Argonne Leadership Computing Facility AI Testbed.                                                                                  | 5 mins  | ALCF Staff                                   |
| Welcome to Groq                                     | Introduction to the AI/ML space, who we are, and applications that can leverage Groq for inference.                                                    | 5 mins  | Jonathan Ross, CEO & Founder                 |
| Groq Tensor Streaming<br>Processor™<br>Architecture | Deep dive on the Groq Language Processing Unit™<br>(LPU) tensor streaming architecture, including in-depth<br>explanations on each module of the chip. | 45 mins | Andrew Bitar, Sr. Staff Compiler<br>Engineer |
| Intro to MLAgility™ and<br>GroqFlow™                | Introduction to the MLAgility Project's HuggingFace<br>Space and the GroqFlow toolchain used to port models.                                           | 15 mins | Sanjif Shanmugavelu, Software<br>Engineer    |
| Porting Models with<br>GroqFlow™                    | Step-by-step walkthrough of model porting with<br>GroqFlow for execution on GroqRack (including best<br>practices).                                    | 45 mins | Sanjif Shanmugavelu, Software<br>Engineer    |
| Benchmarking Models<br>with MLAgility™              | How to benchmark multiple models with MLAgility.                                                                                                       | 45 mins | Sanjif Shanmugavelu, Software<br>Engineer    |
| Accessing GroqRack™<br>at ALCF AI Testbed           | How to access GroqRack.                                                                                                                                | 20 mins | ALCF Staff                                   |

# Welcome to Groq

**Jonathan Ross** Founder & CEO



# Groq Tensor Streaming Processor™ Architecture

**Andrew Bitar** Sr. Staff Compiler Engineer

rog "© 2023 Groq, Inc. | Groq Al Workshop

### Groq Tensor Streaming Architecture

#### AGENDA

- 1. Architecture Overview
- 2. Key Functional Units
- 3. Scaling to 1000s of GroqChip™ Processors





performance, than a GPU for AI applications like LLMs.



### Groq Simplifies Compute





Graphics Processor (GPU)

#### Tensor Streaming Processor (TSP)

COMPLEX Non-deterministic execution Difficult to program Higher latency Higher costs

#### SIMPLIFIED

Deterministic / Predictable execution Easier compilation Lower latency Higher efficiency at scale



Presented at Crossroads 3D-FPGA Academic Research Center - December 2022

### The Missing Middle

Algorithms

#### Compilers

Dataflow dominated

Statically predictable set of executed operations

Highly-parallel vector operations

Remain a challenge Reliant on hand-tuned libraries

Fragmented front-end ecosystem

Require iterative hardware profiling

High-density compute

using SIMD

Hardware

Less silicon area spent on re-ordering and speculation

More memory bandwidth

#### **#** UNPREDICTABLE

#### ✓ PREDICTABLE

SCOG © 2023 Groq, Inc. | Groq Al Workshop

Predictable Compute Needs Predictable

Hardware.

### GroqChip™ Overview





Build different types of specialized SIMD units



Lay out SIMD units across chip area



Synchronized instruction dispatch across all SIMD units for lockstep execution



High-bandwidth "Stream Registers" for passing data between units



144 Instruction Dispatch Paths

### Empowering Groq<sup>™</sup> Compiler

### Architecture Empowering Software

#### Software-controlled memory

No dynamic hardware caching

 Compiler aware of all data locations at any given point in time

Flat memory hierarchy (no Ll, L2, L3, etc)

 Memory exposed to software as a set of physical banks that are directly addressed

Large on-chip memory capacity (220 MiB) at very high-bandwidth (80 TBps)

 Achieves high compute efficiency even at low operational intensity



### Architecture Empowering Software

#### Lockstep execution of Functional Units

Compiler empowered to perform cycle-accurate instruction scheduling

- Synchronous "threads"
- One instruction issued per cycle at each dispatch path
- Little hardware control needed for managing instruction execution
- < 3% area overhead for instruction dispatch logic</p>



### Architecture Empowering Software

#### Simple, one-dimensional interconnect for inter-FU communication

Compiler can quickly reason about all data movement between FUs

- Eastward and westward paths made up of arrays of "stream registers"
- Stream register = one-cycle hop

No arbiters / queues = software can easily reason about exact data movement without simulation

Travel time calculation as simple as a single add/subtract



Stream Register = 1 hop

### Power of Data Orchestration

#### Given to Groq Compiler



### GroqChip<sup>™</sup> Functional Units

| • | • | • | • | • | • | • | • |  |  |  | • |  | • |  | 2 |  |
|---|---|---|---|---|---|---|---|--|--|--|---|--|---|--|---|--|
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |
|   |   |   |   |   |   |   |   |  |  |  |   |  |   |  |   |  |

### Tensor Streaming Dataflow



Deterministic, predictable performance scales to multi-chip

Spatial pipeline processing

Simple tensor instruction set architecture

Stream programming of massive SIMD, concurrent streams

Large on-ch memory bandwidth

#### GroqChip™ v1 MXM: Matrix Multiply Engines



| мхм | SXM | MEM | VXM | MEM | SXM | МХМ |
|-----|-----|-----|-----|-----|-----|-----|
|     |     |     |     |     |     |     |
|     |     |     |     |     |     |     |
|     |     |     |     |     |     |     |
|     |     |     | _   |     | _   |     |



| Numeric<br>Mode | Size                  | Supported<br>Density | Result<br>Tensor |
|-----------------|-----------------------|----------------------|------------------|
| int8            | [N, 320] x [320, 320] | Two per MXM          | int32            |
| float16         | [N, 320] x [160, 320] | One per MXM          | float32          |

320B x 320B dot product Loads 320B x16 in 20 cycles 20 cycle execution Fully pipelined, N Int8 & float16 Full precision expansion 32-bit accumulate Used Independently or together

#### GroqChip™ v1 VXM: Vector Execution Module



Dataflow begins with memory Read onto Stream Tensor Many concurrent streams are supported in programming model VXM provides a flexible and programmable fabric for Compute Compute occurs on data locality of passing Stream Tensor

MEM bandwidth supports high concurrency

#### GroqChip™ v1 SXM: Switch eXecution Module

| MXM SXM MEM VXM MEM SXM MXM |
|-----------------------------|
|-----------------------------|



Swiss army knife for data manipulation & Intra-vector byte operations

**Distributor:** 4 per hemisphere perform unto mapping of input + mask to output stream within a 16 byte superlane **Transposer:** 2 per hemisphere perform intra-superlane transpose over 16 vectors for 20 superlanes Permuter/Shifter: arbitrary mapping of input + mask, shuffling between 320B vector elements - used for data transforms like pads/reshapes Shift, Rotate, Distribute, Permute, Transpose, Transport to SuperLanes

#### GroqChip™ v1 MEM: On-Chip SRAM



88 independent MEM slices with 8192 addresses (220MiB) each arranged into quad timing groups A read from a single MEM slice creates a 320 Byte stream; a write terminates a stream

Group MEM slices for multi-dimensional tensors or multi-byte data types

Can read and write one physical stream (vector) per cycle, from 2 banks; Interfaces the full 64 stream bandwidth @ 80 TBps

### Scaling to 1000s of GroqChip™ Processors

| • | • | • | ٠ | • |  | • |  | e. |  |  |  |  | ú. |  |
|---|---|---|---|---|--|---|--|----|--|--|--|--|----|--|
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |
|   |   |   |   |   |  |   |  |    |  |  |  |  |    |  |



### Software-Scheduled Network

#### Synchronous Chip-to-Chip communication

Chip-to-Chip (C2C) protocol enables synchronous communication across all TSPs in a network

- Clock drift across TSPs is accounted for deterministically
- Each TSP acts as both Processor + Router
  - Compiler schedules network packets as part of programs loaded onto each TSP in the system

#### No adaptive routing / congestion sensing needed

 Compiler knows exact cycle data should be sent from one TSP and received at another



Software-Scheduled Direct Network



### Deterministic Adaptive Routing

#### **Conventional Network**

- Commonly done based on network backpressure
- Reactive approach makes the routing decision difficult, increases latency, and increases hardware complexity
- Network latency is unpredictable

#### Software-scheduled Network

- Avoids congestion
- Enables maintaining a deterministic TSP architecture to scale to a multi-node deterministic network execution





#### Traditional Non-deterministic Network

#### Software-scheduled Network

#### Grog Public 32

### Low-diameter Network

Minimize the number of hops in the network

The total observed latency and variance increases with the number of hops in the network

Dragonfly is a hierarchical topology that minimizes the number of hops taken

- Local group topology
- All-to-all global topology

Exploits packaging locality







### AllReduce Comparison Results

Supercomputing Without Barriers

### Groq collective communication outperforms state-of-the-art collective AllReduce

Groq RealScale agnostic to common message sizes

Eliminates the need for message aggregation

When normalized, Groq TSP matches the bandwidth at large tensor size while significantly improving bandwidth at intermediate tensor size

- Comparison made with 8 GPU A100 system with NCCL
- A100 system has approximately 3x higher network channel bandwidth



### State-of-the-art LLM Inference Performance

Ten GroqRack<sup>™</sup> Compute Clusters



### Recap

#### Architecture Overview

 Determinism, flat memory hierarchy, 1D interconnect

#### **Key Functional Units**

MXM, VXM, SXM, MEM

#### Scaling to 1000s of GroqChips

 Plesiochronous, low-latency chip-to-chip communication



# **9**roq<sup>™</sup>

## Thank You!

abitar@groq.com

## Intro to MLAgility<sup>TM</sup> & GroqFlow<sup>TM</sup>

Sanjif Shanmugavelu Software Engineer

SCOG © 2023 Groq, Inc. | Groq Al Workshop

## Intro to MLAgility<sup>TM</sup> & GroqFlow<sup>TM</sup>

## AGENDA

- 1. High Level Software Stack Overview
- 2. GroqFlow Intro
- 3. MLAgility Intro



## GroqWare<sup>™</sup> Suite



### DIVERSE SUITE OF DEVELOPMENT TOOLS

| Out-of-Box              | <b>Groq Compiler</b> provides<br>out-of-box support for standard<br>Deep Learning models                                                                                               |
|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Fine Grained<br>Control | <b>Groq API</b> provides finer grained<br>control of GroqChip in order to<br>support custom applications                                                                               |
|                         | +                                                                                                                                                                                      |
|                         | <b>GroqView Profiler</b> provides<br>visualization of the chip's compute<br>and memory usage at compile time                                                                           |
| Productivity<br>Tools   | <b>GroqFlow Tool Chain</b> enables a<br>single line of Pytorch or TensorFlow<br>code to import and transform<br>models through a fully automated<br>tool chain to run on Groq hardware |

## MLAgility

Benchmark performance.

- The kernelless Groq<sup>™</sup> Compiler supports ML models out-the-box.
- MLAgility is an open-source benchmarking tool, demonstrating model support and performance across a variety of platforms (Groq<sup>™</sup>, CPU, GPU etc.).
- You can add your own models and benchmarks.
- Groq<sup>™</sup> performance on the MLAgility benchmark is reproducible and guaranteed.
- Models are ported to the Groq<sup>™</sup> platform with GroqFlow<sup>™</sup>.



Figure 1: Public Grog HuggingFace space

## MLAgility Architecture

**MLAgility Setup** 

The diagram illustrates the MLAgility repository structure.

Simply put, the MLAgility models are leveraged by our benchmarking tool, benchit, to produce benchmarking outcomes showcased on our Hugging Face Spaces page.



\* INDICATES WORKS IN PROGRESS

## Recap

 We port models with GroqFlow and benchmark them with MLAgility



## Porting Models with GroqFlow<sup>TM</sup>

Sanjif Shanmugavelu Software Engineer

## Porting Models with GroqFlow™

### AGENDA

- 1. How To GroqFlow
- 2. GroqFlow Best Practices
- 3. GroqFlow Examples
- 4. Debugging GroqFlow



- 0 import transformers
- 1 import torch
- 2 from groqflow import groqit

```
model = transformers.GPT2Model(transformers.GPT2Config())
 4
    inputs = {
       "input_ids": torch.ones(1, 1_024, dtype=torch.long),
       "attention_mask": torch.ones(1, 1_024, dtype=torch.float),
   }
10
    gmodel = groqit(model,inputs)
11
12
    output = gmodel(**inputs)
13
14
15
```

## Introducing GroqFlow™ Step 1: Get your model

```
import transformers
    import torch
    from groqflow import groqit
    model = transformers.GPT2Model(transformers.GPT2Config())
    inputs = {
       "input_ids": torch.ones(1, 1_024, dtype=torch.long),
       "attention_mask": torch.ones(1, 1_024, dtype=torch.float),
    }
10
    gmodel = groqit(model,inputs)
11
12
    output = gmodel(**inputs)
13
```

14

15

## Introducing GroqFlow™ Step 2: Get some inputs

```
import transformers
    import torch
    from groaflow import groait
 2
    model = transformers.GPT2Model(transformers.GPT2Config())
    inputs = {
       "input_ids": torch.ones(1, 1_024, dtype=torch.long),
       "attention mask": torch.ones(1, 1 024, dtype=torch.float),
   }
10
    gmodel = groqit(model,inputs)
                                                        GrogFlow is building model "bert"
11
                                                              Converting to ONNX
12
                                                              Optimizing ONNX file
    output = gmodel(**inputs)
13
                                                              Checking for Op support
                                                              Converting to FP16
14
                                                              Compiling model
15
                                                              Assembling model
```

```
import transformers
    import torch
    from groaflow import groait
 2
    model = transformers.GPT2Model(transformers.GPT2Config())
    inputs = {
       "input_ids": torch.ones(1, 1_024, dtype=torch.long),
       "attention mask": torch.ones(1, 1 024, dtype=torch.float),
   }
10
    gmodel = groqit(model,inputs)
11
12
                                                         tensor([ 0.3628, 0.0489, 0.2952, 0.0022,
    output = gmodel(**inputs)
13
                                                         -0.0161, 0.3451, -0.3209, 0.0021, ...
14
15
```

## Introducing GroqFlow™ Inference is easy!

## Introducing GroqFlow™ Clear messages

What if things don't go as planned?

Clear feedback on how to move forward

GroqFlow is building model "bert" Converting to ONNX Optimizing ONNX file Checking for Op support Converting to FP16 Compiling model Assembling model

## Introducing GroqFlow<sup>™</sup>



BENCHMARKS programs Intency = gmodel.benchmark() on GroqChip

Groqlt Args

PYTÖRCH K Keras

gmodel = groqit(model, inputs)

## Examples:

groqit(my\_pytorch\_model,inputs)

### Main Groqlt Args

model

- Model to be mapped to a GroqModel
- PyTorch model instance

SCOG © 2023 Groq, Inc. | Groq Al Workshop

Groqlt Args

## gmodel = groqit(model, inputs)

## **Bad Example:**

inputs = tokenizer("I like dogs")

### **Good Example:**

```
inputs = tokenizer("I like dogs", padding="max_length", max_length=128)
```

#### Main Groqlt Args

#### model

- Model to be mapped to a GroqModel
- Can be a PyTorch model instance or a path to an ONNX file

#### inputs

- Dictates the maximum input size the model will support
- Same exact format as your Pytorch inputs
- Hint: Pad your inputs to the right size

Groqlt Args

## gmodel = groqit(model, inputs, num\_chips)

## Example:

groqit(model, inputs, num\_chips=4)

### Main Groqlt Args

#### model

- Model to be mapped to a GroqModel
- Can be a PyTorch model instance or a path to an ONNX file

#### inputs

- Dictates the maximum input size the model will support
- Same exact format as your Pytorch inputs
- Hint: Pad your inputs to the right size

### num\_chips

- Number of GroqChip processors to be used
- Automatically selects by default
- 1, 2 or 4 chips are valid for A1.1 (1, 2, 4, 8 for A1.4)

SCOG © 2023 Groq, Inc. | Groq Al Workshop

Groqlt Args

## gmodel = groqit(model, inputs, rebuild)

## Rebuild a model every time:

groqit(model, inputs, rebuild="always")

### Use cached model if available:

groqit(model, inputs, rebuild="never")

#### Main Groqlt Args

#### model

- Model to be mapped to a GroqModel
- Can be a PyTorch model instance or a path to an ONNX file

#### inputs

- Dictates the maximum input size the model will support
- Same exact format as your Pytorch inputs
- Hint: Pad your inputs to the right size

### num\_chips

- Number of GroqChip processors to be used
- Automatically selects by default
- 1, 2 or 4 chips are valid for A1.1 (1, 2, 4, 8 for A1.4)

### rebuild

- GroqIt loads successfully built models by default
- Set rebuild to "always" to force GroqIt to rebuild it

Groqlt Args

## gmodel = groqit(model, inputs, build\_name)

## Example:

groqit(modelA, inputsA, build\_name="A") — Builds modelA groqit(modelB, inputsB, build\_name="B") — Builds modelB

#### Main Groqlt Args

#### model

- Model to be mapped to a GroqModel
- Can be a PyTorch model instance or a path to an ONNX file

#### inputs

- Dictates the maximum input size the model will support
- Same exact format as your Pytorch inputs
- Hint: Pad your inputs to the right size

### num\_chips

- Number of GroqChip processors to be used
- Automatically selects by default
- 1, 2 or 4 chips are valid for A1.1 (1, 2, 4, 8 for A1.4)

### rebuild

- GroqIt loads successfully built models by default
- Set rebuild to "always" to force GroqIt to rebuild it

### build\_name

- Name used to cache the model
- Defaults to the name of the script

Groq Model Functions

## gmodel = groqit(model, inputs) gmodel(\*\*inputs)

## Example:

>>> pytorch\_model(\*\*inputs) tensor([0.245, 0.235, 0.235, 0.267])

>>> gmodel(\*\*inputs) tensor([0.245, 0.235, 0.235, 0.267])

### Main Groq Model Functions

### inference/forward pass

- The Groq Model is callable like a Pytorch model
- Performing inference doesn't require rebuilding
- Hint: Pad your inputs to the same shape used when creating the model

**Note:** Not useful for timing purposes, since the entire Groq environment is setup each time

Groq Model Functions

## gmodel = groqit(model, inputs) gmodel.benchmark()

(coming soon)

## Example:

>>> latency = gmodel.benchmark() >>> print(f"Latency is {latency}ms") Latency is 0.109ms

### Main Groq Model Functions

### inference/forward pass

- The Groq Model is callable like a Pytorch model
- Performing inference doesn't require rebuilding
- Hint: Pad your inputs to the same shape used when creating the model

**Note:** Not useful for timing purposes, since the entire Groq environment is setup each time

### benchmark (coming soon)

- Returns the average latency of 100 runs in ms
- Latency includes PCIe times + on-chip compute

Groq Model Functions

## gmodel = groqit(model, inputs) gmodel.netron()

### Example:



#### Main Groq Model Functions

#### inference/forward pass

- The Groq Model is callable like a Pytorch model
- Performing inference doesn't require rebuilding
- Hint: Pad your inputs to the same shape used when creating the model

**Note:** Not useful for timing purposes, since the entire Groq environment is setup each time

#### benchmark (coming soon)

- Returns the average latency of 100 runs in ms
- Latency includes PCIe times + on-chip compute

#### netron

• Opens the ONNX model generated by GroqIt

Groq Model Functions

## gmodel = groqit(model, inputs,groq\_view=True) gmodel.groqview()

## Example:

OC



#### **Main Groq Model Functions**

### inference/forward pass

- The Groq Model is callable like a Pytorch model
- Performing inference doesn't require rebuilding
- Hint: Pad your inputs to the same shape used when creating the model

**Note:** Not useful for timing purposes, since the entire Groq environment is setup each time

### benchmark (coming soon)

- Returns the average latency of 100 runs in ms
- Latency includes PCIe times + on-chip compute

#### netron

• Opens the ONNX model generated by GroqIt

### groqview

- Visualize data streams and execution schedule
- Requires compiling with groq\_view flag

## Low Latency Every Time

**BERT-base Latency** 



## GroqCard delivers up to 8.3X better performance on the slowest inference



© 2023 Groq, Inc. | Groq Al Workshop

Nvidia results from publicly available data on github.com/NVIDIA (Batch size-1 on TensorRT v8.0.1.6) \*Lower is better \*\*Increase is limited to host and PCIe IO variance BERT

Groq accelerated BERT inference to achieve a 99th percentile latency of **117 µs** 



## Recap

 GroqFlow is a wrapper around the GroqWare<sup>™</sup> Suite that gives you the power to quickly compile and run models.



## Benchmarking Models with MLAgility<sup>TM</sup>

Sanjif Shanmugavelu Software Engineer

## Benchmarking Models with MLAgility™

## AGENDA

- 1. MLAgility Devices and Runtimes
- 2. MLAgility benchit CLI
- 3. Writing Scripts with MLAgility
- 4. MLAgility Report Generation and Visualization
- 5. MLAgility Future Work



## MLAgility Devices and Runtimes Benchmark setup

MLAgility's tools currently support the following combinations of runtimes and devices. We leverage ONNX files because of their broad compatibility with model frameworks (PyTorch, Keras, etc.), software (ONNX Runtime, TensorRT, Groq Compiler, etc.), and devices (CPUs, GPUs, GroqChip processors, etc.)

| Device Type | Device arg | Runtime                                                                                       | Runtime arg                         | Specific Devices                                    |
|-------------|------------|-----------------------------------------------------------------------------------------------|-------------------------------------|-----------------------------------------------------|
| Nvidia GPU  | nvidia     | TensorRT <sup>†</sup>                                                                         | trt                                 | Any Nvidia GPU<br>supported by TensorRT             |
| x86 CPU     | x86        | ONNX Runtime <sup>‡</sup><br>Pytorch Eager <sup>§</sup><br>Pytorch 2.x Compiled <sup>*§</sup> | ort, torch-eager,<br>torch-compiled | Any Intel or AMD CPU<br>supported by the<br>runtime |
| Groq        | Groq       | GroqFlow                                                                                      | Groq                                | GroqChip1                                           |





The MLAgility Benchmarking and Tools package provides a CLI, benchit, and Python API for benchmarking ML models

Let's benchmark the popular BERT transformer model with benchit: benchit models/transformers/bert.py -device {groq, nvidia x86, }

The device flag specifies the benchmark hardware. The output is saved in the user .cache/mlagility directory

### -device x86

Models discovered during profiling:

| bert.py:            |                                                                                              |  |  |  |  |  |  |  |
|---------------------|----------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|
| model (executed 1x) |                                                                                              |  |  |  |  |  |  |  |
| Model Type:         | Pytorch (torch.nn.Module)                                                                    |  |  |  |  |  |  |  |
| Class:              | <pre>BertModel (<class 'transformers.models.bert.modeling_bert.bertmodel'="">)</class></pre> |  |  |  |  |  |  |  |
| Location:           | /home/jfowers/mlagility/models/transformers/bert.py, line 18                                 |  |  |  |  |  |  |  |
| Parameters:         | 109,482,240 (208.8 MB)                                                                       |  |  |  |  |  |  |  |
| Hash:               | d59172a2                                                                                     |  |  |  |  |  |  |  |
| Status:             | Successfully benchmarked on Intel(R) Xeon(R) CPU @ 2.20GHz (ort v1.14.1)                     |  |  |  |  |  |  |  |
|                     | Mean Latency: 345.341 milliseconds (ms)                                                      |  |  |  |  |  |  |  |
|                     | Throughput: 2.9 inferences per second (IPS)                                                  |  |  |  |  |  |  |  |

### -device nvidia

hello

Models discovered during profiling:

| _world.py:                         |                                                                         |  |  |  |  |  |  |  |
|------------------------------------|-------------------------------------------------------------------------|--|--|--|--|--|--|--|
| <pre>pytorch_model (executed</pre> | 1x)                                                                     |  |  |  |  |  |  |  |
| Model Type:                        | Pytorch (torch.nn.Module)                                               |  |  |  |  |  |  |  |
| Class:                             | <pre>SmallModel (<class 'hello_world.smallmodel'="">)</class></pre>     |  |  |  |  |  |  |  |
| Location:                          | <pre>/home/jfowers/mlagility/examples/cli/hello_world.py, line 29</pre> |  |  |  |  |  |  |  |
| Parameters:                        | 55 (<0.1 MB)                                                            |  |  |  |  |  |  |  |
| Hash:                              | 479b1332                                                                |  |  |  |  |  |  |  |
| Status:                            | Model successfully benchmarked on NVIDIA A100-SXM4-40GB                 |  |  |  |  |  |  |  |
|                                    | Mean Latency: 0.027 milliseconds (ms)                                   |  |  |  |  |  |  |  |
|                                    | Throughput: 21920.5 inferences per second (IPS)                         |  |  |  |  |  |  |  |
|                                    |                                                                         |  |  |  |  |  |  |  |

pytorch\_outputs: tensor([-0.1675, 0.1548, -0.1627, 0.0067, 0.3353], grad\_fn=<AddBackward0>)

Woohoo! The 'benchmark' command is complete.

MLAgility Input How to write a benchmark script

The following example, copied from models/transformers/bert.py is a sample input script for the MLAgility benchmark

It has the following properties:

- Labels in the top line of the file
- Docstring indicating where the model was sourced from
- mlagility.parser.parse() is used to parameterize the model
- The model is instantiated and invoked against a set of inputs

# labels: test\_group::mlagility name::bert author::huggingface\_pytorch
"""
https://huggingface.co/docs/transformers/v4.26.1/en/model\_doc/bert#overview
"""
from mlagility.parser import parse
import transformers
import torch
torch.manual\_seed(0)
# Parsing command-line arguments

batch\_size, max\_seq\_length = parse(["batch\_size", "max\_seq\_length"])

# Model and input configurations config = transformers.BertConfig() model = transformers.BertModel(config) inputs = { "input\_ids": torch.ones(batch\_size, max\_seq\_length, dtype=torch.long), "attention\_mask": torch.ones(batch\_size, max\_seq\_length, dtype=torch.float), } # Call model model(\*\*inputs)

## MLAgility Full Benchmark

Automated push-button benchmarking

Once you have fulfilled the prerequisites, you can evaluate one model from the benchmark with a command like this:

cd MLAGILITY\_ROOT/models # MLAGILITY\_ROOT is where you
cloned mlagility
benchit selftest/linear.py

You can also run the entire MLAgility benchmark in one shot with:

cd MLAGILITY\_ROOT/models # MLAGILITY\_ROOT is where you
cloned mlagility
benchit \*/\*.py

Note: Benchmarking the entire corpora of MLAgility models might take a very long time

## MLAgility Report Generation

Collect and present results

You can aggregate all of the benchmarking results from your mlagility cache into a CSV file with:

## benchit report

If you want to only report on a subset of models, we recommend saving the benchmarking results into a specific cache directory:

By default, all
results are saved in
/home/{\$USER}/.cache/
mlagility)

# Save benchmark results into a specific cache directory benchit models/selftest/\*.py -d selftest\_results

# Report the results from the `selftest\_results` cache
benchit report -d selftest\_results

## MLAgility Limitations and Future Work

### **Current Limitations / Constraints:**

Groq's latency is computed using GroqModel.estimate\_latency()

Takes into account deterministic compute time and estimates an ideal runtime with ideal I/O time It does not take into account runtime performance

Results currently only represent batch 1 performance

Limited number of models, devices, vendors, and runtimes

## MLAgility Limitations and Future Work

To infinity and beyond

### Future work:

| ado | lude<br>ditional<br>sses of | Experiments that<br>include sweeps<br>over batch and | Include operator<br>microbenchmarks | Increase the<br>number of<br>devices from | Include devices<br>from additional<br>vendors and |
|-----|-----------------------------|------------------------------------------------------|-------------------------------------|-------------------------------------------|---------------------------------------------------|
|     | dels                        | input sizes                                          |                                     |                                           | number of<br>runtimes<br>supported                |
|     |                             |                                                      |                                     |                                           |                                                   |

## Recap

 MLAgility is a fully open-source benchmarking tool to benchmark acceleration hardware and runtimes.



# **9**roq<sup>™</sup>

## Thank You!

sshanmugavelu@groq.com