Analysis

class llm_analysis.analysis.ActivationRecomputation(value)[source]

Bases: Enum

An enumeration.

ATTN = 2

Selectively checkpoints the input to the attention module in a transformer layer; requires an extra forward pass on attention.

ATTN_COMPUTE = 1

Selectively checkpoints the attention computation (QK^T matrix multiply, softmax, softmax dropout, and attention over V.) in the attention module of a transformer layer; this part takes up a considerable amount of memory but are not computationally expensive to recompute

FULL = 4

Full activation recomputation stores the input to the transformer layer; requires the least amount of memory; requires an extra forward pass of the layer.

NONE = 0

No activation recomputation; requires the most amount of memory.

NORM_ATTN_NORM = 3

Selectively checkpoints the input to the sequence of modules (layernom-attention-layernom) in a transformer layer; requires an extra forward pass on (layernom-attention-layernom).

class llm_analysis.analysis.DSZeRO(value)[source]

Bases: Enum

An enumeration.

NONE = 0

No DeepSPeed ZeRO; requires the most amount of memory.

STAGE_1 = 1

ZeRO stage 1 shards the optimizer states across the data parallel group.

STAGE_2 = 2

ZeRO stage 2 shards the optimizer states and gradients across the data parallel group.

STAGE_3 = 3

ZeRO stage 3 shards the optimizer states, gradients, and model weights across the data parallel group.

class llm_analysis.analysis.DtypeConfig(name='w16a16e16', weight_bits=16, activation_bits=16, embedding_bits=16)[source]

Bases: object

activation_bits: int = 16
embedding_bits: int = 16
name: str = 'w16a16e16'
weight_bits: int = 16
class llm_analysis.analysis.Enum(value)[source]

Bases: object

Generic enumeration.

Derive from this class to define new enumerations.

name

The name of the Enum member.

value

The value of the Enum member.

class llm_analysis.analysis.GPUConfig(name, mem_per_GPU_in_GB, hbm_bandwidth_in_GB_per_sec, intra_node_bandwidth_in_GB_per_sec, intra_node_min_message_latency, peak_fp16_TFLOPS, peak_i8_TFLOPS=None, peak_i4_TFLOPS=None, inter_node_bandwidth_in_GB_per_sec=200)[source]

Bases: object

hbm_bandwidth_in_GB_per_sec: float
inter_node_bandwidth_in_GB_per_sec: float = 200
intra_node_bandwidth_in_GB_per_sec: float
intra_node_min_message_latency: float
mem_per_GPU_in_GB: float
name: str
peak_fp16_TFLOPS: float
peak_i4_TFLOPS: float = None
peak_i8_TFLOPS: float = None
class llm_analysis.analysis.LLMAnalysis(model_config, gpu_config, dtype_config=DtypeConfig(name='w16a16e16', weight_bits=16, activation_bits=16, embedding_bits=16), parallelism_config=ParallelismConfig(tp_size=1, pp_size=1, dp_size=1, ep_size=1, sp_size=1), achieved_tflops=None, achieved_memory_bandwidth_GBs=None, flops_efficiency=None, hbm_memory_efficiency=None, intra_node_memory_efficiency=1.0, inter_node_memory_efficiency=1.0)[source]

Bases: object

Given the specified model, GPU, data type, parallelism configuration/implementation, LLMAnalysis estimates the latency and memory usage of LLMs for training or inference.

Refer to the train and infer entry functions for usage details.

config_batch_size_and_gradient_accumulation_steps(max_batch_size_per_gpu, batch_size_per_gpu=None, gradient_accumulation_steps=None, global_batch_size=None)[source]

Configure batch_size_per_gpu, gradient_accumulation_steps and global_batch_size (effective batch size). If none is given, find a maximum batch_size_per_gpu while satisfying the constraint global_batch_size == batch_size_per_gpu * gradient_accumulation_steps * dp_size.

Parameters:
  • max_batch_size_per_gpu (int) – the max batch size per gpu before OOM

  • batch_size_per_gpu (int, optional) – batch size per GPU. Defaults to None.

  • gradient_accumulation_steps (int, optional) – gradient accumulation steps. Defaults to None.

  • global_batch_size (int, optional) – global batch size (effective batch size). Defaults to None.

Returns:

(batch_size_per_gpu, gradient_accumulation_steps, global_batch_size)

Return type:

tuple

get_TFLOPS_per_gpu()[source]

Get the expected TFLOPS per GPU for the specified data type configuration/GPU (adjusted by flops_efficiency)

Returns:

TFLOPS per GPU

Return type:

float

get_activation_memory_output_embedding(batch_size, seq_len)[source]

Get the memory (in bytes) required to store the activations of output embedding (logits)

Return type:

float

get_activation_memory_per_layer(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE, layernorm_dtype_bytes=4, flash_attn=True, softmax_dropout=False, mlp_activation_quant_bits=None, mlp_1linear_quant_bits=None, mlp_gelu_input_quant_bits=None, mlp_2linear_quant_bits=None, mlp_recompute_gelu=False, return_breakdown=False)[source]

Get the memory (in bytes) required to store the activations of a transformer layer, given the batch size, sequence length, and whether it is inference or training, the activation recomputation strategy, and the activation data type. Refer to https://arxiv.org/abs/2205.05198 for details. For inference, this assumes the maximum tensor buffer reuse.

Parameters:
  • batch_size (int) –

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Return the max memory activation tensor size between layernorm/attn/mlp. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be FP32 in training to maintain model accuracy.

  • flash_attn (bool, optional) – whether to use Flash Attention. Defaults to True.

  • softmax_dropout (bool, optional) – whether to apply dropout after softmax. Defaults to False.

  • mlp_activation_quant_bits (int, optional) – number of bits to quantize MLP activations; if set, override the values for mlp_1linear_quant_bits, mlp_gelu_input_quant_bits and mlp_2linear_quant_bits. Defaults to None.

  • mlp_1linear_quant_bits (int, optional) – number of bits to quantize the input activations of the first linear layer. Defaults to None.

  • mlp_gelu_input_quant_bits (int, optional) – number of bits to quantize the GELU input activations. Defaults to None.

  • mlp_2linear_quant_bits (int, optional) – number of bits to quantize the input activations of the second linear layer. Defaults to None. mlp_recompute_gelu (bool, optional): whether to recompute the gelu activation in the MLP backward pass. Defaults to False.

Returns:

the memory (in bytes) required to store the activations of a transformer layer or a tuple of its breakdown

Return type:

Union[float, tuple]

get_activation_memory_per_layer_attn(batch_size, seq_len, is_inference=True, flash_attn=True, softmax_dropout=False, attn_dropout=True, activation_recomputation=ActivationRecomputation.NONE)[source]

Get the memory (in bytes) required to store the activations of the attention in a transformer layer, given the batch size, sequence length, whether it is inference or training, the activation recomputation strategy, and the activation data type. The attn activations include the input to Q/K/V gemm, QK^T matrix multiply, softmax, softmax dropout attention over V, the input to the attention output Gemm; if training, also include the softmax dropout mask and attention dropout mask; Refer to https://arxiv.org/abs/2205.05198 for details.

Parameters:
  • batch_size (int) – micro batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • flash_attn (bool, optional) – whether to use Flash Attention. Defaults to True.

  • softmax_dropout (bool, optional) – whether to apply dropout after softmax. Defaults to False.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

Returns:

the memory (in bytes) required to store the activations of the attention in a transformer layer

Return type:

float

get_activation_memory_per_layer_mlp(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE, mlp_activation_quant_bits=None, mlp_1linear_quant_bits=None, mlp_gelu_input_quant_bits=None, mlp_2linear_quant_bits=None, recompute_gelu=False, gated_linear_units=False, with_dropout=False)[source]

Get the memory (in bytes) required to store the activations of the MLP in a transformer layer, given the batch size, sequence length, and whether it is inference or training, the activation recomputation strategy, and the activation data type. The mlp activations include the input to the two linear layers. Refer to https://arxiv.org/abs/2205.05198 for details.

Parameters:
  • batch_size (int) – micro batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

  • mlp_activation_quant_bits (int, optional) – number of bits to quantize MLP activations; if set, override the values for mlp_1linear_quant_bits, mlp_gelu_input_quant_bits and mlp_2linear_quant_bits. Defaults to None.

  • mlp_1linear_quant_bits (int, optional) – number of bits to quantize the input activations of the first linear layer. Defaults to None.

  • mlp_gelu_input_quant_bits (int, optional) – number of bits to quantize the GELU input activations. Defaults to None.

  • mlp_2linear_quant_bits (int, optional) – number of bits to quantize the input activations of the second linear layer. Defaults to None.

  • recompute_gelu (bool, optional) – whether to recompute gelu in backward pass.

  • gated_linear_units (bool, optional) – whether to use gated linear units.

Returns:

the memory (in bytes) required to store the activations of the MLP in a transformer layer

Return type:

float

get_activation_memory_per_layernorm(batch_size, seq_len, dtype_bytes=4)[source]

Get the memory (in bytes) required to store the activations of a single layernorm in a transformer layer, given the batch size, sequence length. Refer to https://arxiv.org/abs/2205.05198 for details.

Parameters:
  • batch_size (int) – micro batch size

  • seq_len (int) – sequence length

  • dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activation. Defaults to BYTES_FP32. Need to be at least FP16 to maintain accuracy.

Returns:

the memory (in bytes) required to store the activations of a single layernorm in a transformer layer

Return type:

float

get_configs_desc()[source]
Return type:

str

get_gpu_hbm_bandwidth()[source]
Return type:

float

get_inter_node_bandwidth()[source]
Return type:

float

get_intra_node_bandwidth()[source]
Return type:

float

get_latency_fwd(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE, layernorm_dtype_bytes=4, breakdown_prefix='', ds_zero=DSZeRO.NONE)[source]

Get the latency for the forward pass of the transformer, given the batch size, sequence length, and whether it is inference or not, the activation recomputation strategy, and the number of bytes in the data type for the layernorm activations.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be FP32 in training to maintain model accuracy.

  • breakdown_prefix (str, optional) – prefix for the breakdown dict keys. Defaults to “”.

  • ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled).

Returns:

a tuple of the latency in seconds for the forward pass of the transformer and its breakdown dict

Return type:

tuple

get_latency_fwd_input_embedding(batch_size, seq_len, dtype_bytes=4)[source]

Get the latency for the forward pass of the input embedding layer, given the batch size, sequence length, and data type of the embedding weight.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • dtype_bytes (int, optional) – number of bytes in the data type for the embedding weight. Defaults to BYTES_FP32.

Returns:

the latency in seconds for the forward pass of the input embedding layer

Return type:

float

get_latency_fwd_output_embedding_loss(batch_size, seq_len)[source]

Get the latency for the forward pass of the output embedding layer (computing the logits). The operation is compute bound. With tensor parallelism size > 1, an allgather communicates batch_size * seq_len elements, which is ignored here. Refer to https://arxiv.org/abs/1909.08053 for more details.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the latency in seconds for the forward pass of the output embedding layer

Return type:

float

get_latency_fwd_per_layer(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE, layernorm_dtype_bytes=4, ds_zero=DSZeRO.NONE)[source]

Get the latency for the forward pass of a transformer layer, given the batch size, sequence length, training or inference, activation recomputation strategy, and layernorm data type. The latency is the sum of the latency for the attention module, MLP module, two layernorms, and two (Megatron-LM tp implementation) allreduce communications across the tensor parallel group.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be FP32 in training to maintain model accuracy.

  • ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled).

Returns:

a tuple of the latency in seconds for the forward pass of a transformer layer and its breakdown dict

Return type:

tuple

get_latency_fwd_per_layer_attn(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE)[source]

Get the latency for the forward pass of the attention module in a transformer layer, given the batch size and sequence length. The latency is the max of the compute latency and the memory latency, assuming the compute and memory operations are perfectly overlapped.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

Returns:

the latency in seconds for the forward pass of the attention module in a transformer layer

Return type:

float

get_latency_fwd_per_layer_mlp(batch_size, seq_len, is_inference=True, activation_recomputation=ActivationRecomputation.NONE)[source]

Get the latency for the forward pass of the MLP module in a transformer layer, given the batch size and sequence length. The latency is the max of the compute latency and the memory latency, assuming the compute and memory operations are perfectly overlapped.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • is_inference (bool, optional) – whether it is inference or not. Defaults to True.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

Returns:

the latency in seconds for the forward pass of the MLP module in a transformer layer

Return type:

float

get_latency_fwd_per_layer_mlp_moe_alltoall(batch_size, seq_len)[source]
Return type:

float

get_latency_fwd_per_layer_shared_dp_comm()[source]
Return type:

float

get_latency_fwd_per_layernorm(batch_size, seq_len, dtype_bytes=4)[source]

Get the latency for the forward pass of a single layernorm in a transformer layer, given the batch size, sequence length, and data type. The latency is the memory latency as layernorm is a memory-bound operation.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activation. Defaults to BYTES_FP32. Need to be at least FP16 to maintain accuracy.

Returns:

the latency in seconds for the forward pass of a single layernorm in a transformer layer

Return type:

float

get_latency_fwd_per_tp_comm(batch_size, seq_len, dtype_bytes)[source]

Get the latency of a single allreduce communication across the tensor parallel group in the forward pass of a transformer layer, given the batch size, sequence length, and data type, and assuming a ring allreduce implementation. The latency is the max of the latency for the allreduce and the minimum message latency through intra-node connect (Note that tensor parallelism size <= number of GPUs per node).

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • dtype_bytes (int) – number of bytes in the data type

Returns:

the latency in seconds for a single allreduce communication across the tensor parallel group in the forward pass of a transformer layer

Return type:

float

get_latency_weight_update()[source]
get_memory_embedding(ds_zero=DSZeRO.NONE)[source]

Get the memory (in bytes) required to store the embedding layer, given the number of parameters in the embedding layer, the data type (defaults to FP32) used for the weights, and the tensor parallelism size (Megatron-LM partitions the embedding layer across the tensor parallel groups).

Parameters:

ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled, no sharding).

Returns:

the memory (in bytes) required to store the embedding layer

Return type:

float

get_memory_kv_cache_per_layer(batch_size, seq_len, kv_cache_dtype_bytes=None)[source]

Get the memory (in bytes) required to store the key and value cache for a transformer layer in inference, given the batch size, sequence length, activation data type, and tensor parallelism size.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

  • kv_cache_dtype_bytes (int, optional) – number of bytes in the data type for the kv_cache. Defaults to None. Often has to be at least FP16 in inference to maintain model accuracy.

Returns:

the memory (in bytes) required to store the key and value cache for a transformer layer in inference

Return type:

float

get_memory_optimizer_state_and_gradient_embedding(master_weights_dtype_bytes=4, other_op_bytes=None, ds_zero=DSZeRO.NONE)[source]
Return type:

tuple

get_memory_optimizer_state_and_gradient_last_layernorm(master_weights_dtype_bytes=4, other_op_bytes=None, ds_zero=DSZeRO.NONE)[source]
Return type:

tuple

get_memory_optimizer_state_and_gradient_per_layer(master_weights_dtype_bytes=4, other_op_bytes=None, ds_zero=DSZeRO.NONE)[source]

Get the memory (in bytes) required to store the gradients and optimizer states of a transformer layer. The optimizer states include the master weights and other states such as momentum. The gradients need to be upcasted to the same data type as the optimizer master weights before being applied.

The default assumes using Adam optimizer (https://arxiv.org/abs/1412.6980), which requires the full-precision master weights (master_weights_dtype_bytes=4), momentum and variance (other_op_bytes=8). For other optimizers, use master_weights_dtype_bytes and other_op_bytes to express the bytes needed. For example, with lion optimizer (https://arxiv.org/abs/2302.06675), other_op_bytes = 4 as it only requires FP32 momentum.

With DeepSpeed ZeRO stage 1 and above, the optimizer states are sharded across data parallel groups. With ZeRO stage 2 and above, the gradients are sharded across the data parallel group. With FSDP SHARD_GRAD_OP or FULL_SHARD, the gradients and optimizer states are sharded across data parallel groups.

Parameters:
  • master_weights_dtype_bytes (int) – the number of bytes in the data type for the optimizer master weights. Defaults to BYTES_FP32.

  • other_op_bytes (int) – the number of bytes in the optimizer state. Defaults to None, which assumes using Adam optimizer.

  • ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled, no sharding).

Returns:

a tuple of the memory (in bytes) required to store the optimizer states and gradients of a transformer layer

Return type:

tuple

get_num_active_params_per_layer()[source]

Get the number of active parameters in a transformer layer, including the attention and MoE MLP linear layers.

Returns:

the number of parameters in a transformer layer

Return type:

int

get_num_active_params_total()[source]

Get the total number of parameters in the model, including all the transformer layers and the embedding layer.

Returns:

the total number of parameters in the model

Return type:

int

get_num_flops_bwd_total(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for the backward pass of the entire transformer, estimated as the twice the number of flops for the forward pass. The count is model-specific and does not depend on the parallelism strategy.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the number of floating point operations for the backward pass of the entire transformer

Return type:

int

get_num_flops_fwd_per_layer(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for the forward pass of a transformer layer, given the batch size and sequence length. The count is model- specific and does not depend on the parallelism strategy.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the number of floating point operations for the forward pass of a transformer layer

Return type:

int

get_num_flops_fwd_per_layer_attn(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for the forward pass of the attention module in a transformer layer, given the batch size and sequence length. The count is model-specific and does not depend on the parallelism strategy.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the number of floating point operations for the forward pass of the attention module in a transformer layer

Return type:

int

get_num_flops_fwd_per_layer_mlp(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for the forward pass of the MLP module in a transformer layer, given the batch size and sequence length. The count is model-specific and does not depend on the parallelism strategy.s.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the number of floating point operations for the forward pass of the MLP module in a transformer layer

Return type:

int

get_num_flops_fwd_total(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for the forward pass of the entire transformer, given the batch size and sequence length. The count is model-specific and does not depend on the parallelism strategy.

Parameters:
  • batch_size (int) – batch size

  • seq_len (int) – sequence length

Returns:

the number of floating point operations for the forward pass of the entire transformer

Return type:

int

get_num_flops_total_attn_compute(batch_size, seq_len)[source]

Get the number of floating point operations (flops) for recomputation when selectively checkpointing the attention computation (QK^T matrix multiply, softmax, softmax dropout, and attention over V.). The count is model-specific and does not depend on the parallelism strategy. :type batch_size: int :param batch_size: batch size :type batch_size: int :type seq_len: int :param seq_len: sequence length :type seq_len: int

Returns:

the number of floating point operations for recomputation when using selective activation recomputation

Return type:

int

get_num_params_embedding(shared_embedding=True)[source]

Get the number of parameters in the embedding layer.

Parameters:

shared_embedding (bool, optional) – whether the output embedding shares weights with the input embedding. Defaults to True.

Returns:

the number of parameters in the embedding layer

Return type:

int

get_num_params_last_layernorm()[source]
Return type:

int

get_num_params_per_layer()[source]

Get the number of parameters in a transformer layer, including the attention and MLP linear layers.

Returns:

the number of parameters in a transformer layer

Return type:

int

get_num_params_per_layer_attn()[source]

Get the number of parameters in the attention linear layers, including the query/key/value projection and output matrices.

Returns:

the number of parameters in the attention linear layers

Return type:

int

get_num_params_per_layer_layernorm()[source]
Return type:

int

get_num_params_per_layer_mlp()[source]

Get the number of parameters in the MLP linear layers, including the intermediate and output matrices.

Returns:

the number of parameters in the two MLP linear layers

Return type:

int

get_num_params_per_layer_router()[source]
Return type:

int

get_num_params_total()[source]

Get the total number of parameters in the model, including all the transformer layers and the embedding layer.

Returns:

the total number of parameters in the model

Return type:

int

get_pivot()[source]

Return the pivot point, defined as (model_weights / hbm_bandwidth) / (model_flops / TFLOPS_per_gpu)

Returns:

pivot point

Return type:

float

get_readable_summary_dict(summary_dict, title='Summary')[source]
Return type:

str

get_weight_memory_last_layernorm(ds_zero=DSZeRO.NONE)[source]
get_weight_memory_per_layer(ds_zero=DSZeRO.NONE, return_breakdown=False)[source]

Get the memory (in bytes) required to store the weights of a transformer layer, given the number of parameters in a transformer layer, the data type used for the weights, the tensor parallelism size, and the DeepSpeed ZeRO stage. WIth ZeRO Stage 3, the weights are sharded across data parallel groups.

Parameters:

ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled).

Returns:

the memory (in bytes) required to store the weights of a transformer layer, or a tuple of its breakdown

Return type:

Union[float, tuple]

inference(batch_size_per_gpu=1, seq_len=512, num_tokens_to_generate=32, use_kv_cache=True, ds_zero=DSZeRO.NONE, layernorm_dtype_bytes=2, kv_cache_dtype_bytes=None, cost_per_gpu_hour=None, output_dir=None, output_file_suffix='')[source]

Inference analysis given the configs and inputs.

Parameters:
  • batch_size_per_gpu (int, optional) – batch size per gpu. Defaults to 1.

  • seq_len (int, optional) – number of input tokens. Defaults to 512.

  • num_tokens_to_generate (int, optional) – number of tokens to generate for generative models. Defaults to 32.

  • use_kv_cache (bool, optional) – whether to use kv_cache. Defaults to True.

  • ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled).

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be at least FP16 in inference to maintain model accuracy.

  • kv_cache_dtype_bytes (int, optional) – number of bytes in the data type for the kv_cache. Defaults to None. Often has to be at least FP16 in inference to maintain model accuracy.

  • cost_per_gpu_hour (float, optional) – dollar cost per GPU hour. Defaults to None.

  • output_dir (str, optional) – if set to a directory path, write the return summary dict out to the directory with the setup. Defaults to None.

  • output_dir – if set to a directory path, write the return summary dict out to the directory with the setup. Defaults to None.

Returns:

a summary dict of the training analysis

Return type:

dict

output_summary_dict(summary_dict, output_dir, print_human_readable=True, output_file_suffix='')[source]
print_config(name='Training Configs')[source]
Return type:

None

training(batch_size_per_gpu=None, gradient_accumulation_steps=None, global_batch_size=None, seq_len=None, total_num_tokens=None, activation_recomputation=ActivationRecomputation.NONE, ds_zero=DSZeRO.NONE, layernorm_dtype_bytes=4, master_weights_dtype_bytes=4, other_op_bytes=None, flash_attn=True, softmax_dropout=False, mlp_activation_quant_bits=None, mlp_1linear_quant_bits=None, mlp_gelu_input_quant_bits=None, mlp_2linear_quant_bits=None, mlp_recompute_gelu=False, output_dir=None, output_file_suffix='')[source]

Training analysis given the configs and inputs.

Parameters:
  • batch_size_per_gpu (int, optional) – batch size per gpu (micro batch size). Defaults to None.

  • gradient_accumulation_steps (int, optional) – gradient accumulation steps. Defaults to None.

  • global_batch_size (int, optional) – global batch size. Defaults to None.

  • seq_len (int, optional) – sequence length. Defaults to None.

  • total_num_tokens (int, optional) – total number of tokens used for training. Defaults to None.

  • activation_recomputation (ActivationRecomputation, optional) – activation recomputation strategy. Defaults to ActivationRecomputation.NONE.

  • ds_zero (DSZeRO, optional) – which DeepSpeed ZeRO stage to use. Defaults to DSZeRO.NONE (disabled).

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be FP32 in training to maintain model accuracy.

  • master_weights_dtype_bytes (int) – the number of bytes in the data type for the optimizer master weights. Defaults to BYTES_FP32.

  • other_op_bytes (int, optional) – the number of bytes in the optimizer state. Defaults to None, which assumes using Adam optimizer.

  • flash_attn (bool, optional) – whether to use Flash Attention. Defaults to True.

  • softmax_dropout (bool, optional) – whether to apply dropout after softmax. Defaults to False.

  • mlp_activation_quant_bits (int, optional) – number of bits to quantize MLP activations; if set, override the values for mlp_1linear_quant_bits, mlp_gelu_input_quant_bits and mlp_2linear_quant_bits. Defaults to None.

  • mlp_1linear_quant_bits (int, optional) – number of bits to quantize the input activations of the first linear layer. Defaults to None.

  • mlp_gelu_input_quant_bits (int, optional) – number of bits to quantize the GELU input activations. Defaults to None.

  • mlp_2linear_quant_bits (int, optional) – number of bits to quantize the input activations of the second linear layer. Defaults to None.

  • mlp_recompute_gelu (bool, optional) – whether to recompute the gelu activation in the MLP backward pass. Defaults to False.

  • output_dir (str, optional) – if set to a directory path, write the return summary dict out to the directory with the setup. Defaults to None.

Returns:

a summary dict of the training analysis

Return type:

dict

)

update_dtype_config(dtype_config)[source]
Return type:

None

update_float_efficiency(flops_efficiency)[source]
Return type:

None

update_gpu_config(gpu_config)[source]
Return type:

None

update_inter_node_memory_efficiency(inter_node_memory_efficiency)[source]
Return type:

None

update_intra_node_memory_efficiency(intra_node_memory_efficiency)[source]
Return type:

None

update_model_config(model_config)[source]
Return type:

None

update_parallelism_config(parallelism_config)[source]
Return type:

None

class llm_analysis.analysis.ModelConfig(name, num_layers, n_head, hidden_dim, vocab_size, max_seq_len=None, num_key_value_heads=None, num_key_value_groups=None, ffn_embed_dim=None, expansion_ratio=None, model_type=None, moe_num_experts=1, moe_top_k=1, mlp_gated_linear_units=False)[source]

Bases: object

expansion_ratio: float = None
ffn_embed_dim: int = None
hidden_dim: int
max_seq_len: int = None
mlp_gated_linear_units: bool = False
model_type: str = None
moe_num_experts: int = 1
moe_top_k: int = 1
n_head: int
name: str
num_key_value_groups: int = None
num_key_value_heads: int = None
num_layers: int
vocab_size: int
class llm_analysis.analysis.ParallelismConfig(tp_size=1, pp_size=1, dp_size=1, ep_size=1, sp_size=None)[source]

Bases: object

dp_size: int = 1
ep_size: int = 1
pp_size: int = 1
sp_size: int = None
tp_size: int = 1
llm_analysis.analysis.get_dtype_config_by_name(name)[source]

Get data type config from the populated mapping by name.

Return type:

DtypeConfig

llm_analysis.analysis.get_gpu_config_by_name(name)[source]

Get gpu config from the populated mapping by name.

Return type:

GPUConfig

llm_analysis.analysis.get_model_config_by_name(name_or_path)[source]

Get model config from the populated mapping by name, or from model config json file path, if not found from the previous methods, try to get it from HuggingFace.

Return type:

ModelConfig

llm_analysis.analysis.infer(model_name='facebook_opt-1.3b', gpu_name='a100-sxm-40gb', dtype_name='w16a16e16', log_level='INFO', batch_size_per_gpu=1, ds_zero=0, dp_size=1, tp_size=1, pp_size=1, sp_size=None, seq_len=512, num_tokens_to_generate=32, use_kv_cache=True, layernorm_dtype_bytes=2, kv_cache_dtype_bytes=None, achieved_tflops=None, achieved_memory_bandwidth_GBs=None, flops_efficiency=None, hbm_memory_efficiency=None, intra_node_memory_efficiency=1.0, inter_node_memory_efficiency=1.0, cost_per_gpu_hour=None, output_dir=None, output_file_suffix='')[source]

_summary_

Parameters:
  • model_name (str, optional) – model name to query the pre-defined model_configs dict, or model config json file path, if not found, query Hugging Face to construct ModelConfig. Defaults to “facebook_opt-1.3b”.

  • gpu_name (str, optional) – gpu name to query the pre-defined gpu_configs dict. Defaults to “a100-sxm-40gb”.

  • dtype_name (str, optional) – data type name to pre-defined dtype_configs dict. Defaults to “w16a16e16”.

  • log_level (str, optional) – logging level. Defaults to “INFO”.

  • batch_size_per_gpu (int, optional) – batch size per GPU. Defaults to 1.

  • ds_zero (int, optional) – which DeepSpeed ZeRO stage to use. See DSZeRO. Defaults to 0.

  • dp_size (int, optional) – data parallelism size. Defaults to None.

  • tp_size (int, optional) – tensor parallelism size. Defaults to 1.

  • pp_size (int, optional) – pipeline parallelism size. Defaults to 1.

  • sp_size (int, optional) – sequence parallelism size. Defaults to tp_size.

  • seq_len (int, optional) – input sequence length. Defaults to 512.

  • num_tokens_to_generate (int, optional) – number of tokens to generate for generative models. Defaults to 32.

  • use_kv_cache (bool, optional) – whether to use kv cache. Defaults to True.

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Defaults to BYTES_FP32. Often has to be at least FP16 in inference to maintain model accuracy.

  • kv_cache_dtype_bytes (int, optional) – number of bytes in the data type for the kv_cache. Defaults to None. Often has to be at least FP16 in inference to maintain model accuracy.

  • achieved_tflops (float, optional) – achieved TFLOPS per GPU. If specified, will override the flops_efficiency passed in. Defaults to None.

  • achieved_memory_bandwidth_GBs (float, optional) – achieved GPU memory bandwidth in GB/s. If specified, will override the hbm_memory_efficiency passed in. Defaults to None.

  • flops_efficiency (float, optional) – flops efficiency, ranging from 0 to 1. Defaults to None.

  • hbm_memory_efficiency (float, optional) – GPU HBM memory efficiency, ranging from 0 to 1. Defaults to HBM_MEMORY_EFFICIENCY.

  • intra_node_memory_efficiency (float, optional) – intra-node memory efficiency, ranging from 0 to 1. Defaults to INTRA_NODE_MEMORY_EFFICIENCY.

  • inter_node_memory_efficiency (float, optional) – inter-node memory efficiency, ranging from 0 to 1. Defaults to INTER_NODE_MEMORY_EFFICIENCY.

  • cost_per_gpu_hour (float, optional) – dollar cost per GPU hour. Defaults to None.

  • output_dir (str, optional) – if set to a directory path, write the return summary dict out to the directory with the setup. Defaults to None.. Defaults to None.

  • output_file_suffix (str, optional) – suffix of the output file. Defaults to “”.

Returns:

a summary dictionary of the inference analysis

Return type:

dict

llm_analysis.analysis.pformat(object, indent=1, width=80, depth=None, *, compact=False, sort_dicts=True)[source]

Format a Python object into a pretty-printed representation.

llm_analysis.analysis.total_ordering(cls)[source]

Class decorator that fills in missing ordering methods

llm_analysis.analysis.train(model_name='facebook_opt-1.3b', gpu_name='a100-sxm-40gb', dtype_name='w16a16e16', log_level='INFO', batch_size_per_gpu=None, gradient_accumulation_steps=None, global_batch_size=None, seq_len=None, total_num_tokens=None, activation_recomputation=0, ds_zero=0, dp_size=None, tp_size=1, pp_size=1, sp_size=None, ep_size=1, total_num_gpus=None, layernorm_dtype_bytes=4, master_weights_dtype_bytes=4, other_op_bytes=None, flash_attn=True, softmax_dropout=False, mlp_activation_quant_bits=None, mlp_1linear_quant_bits=None, mlp_gelu_input_quant_bits=None, mlp_2linear_quant_bits=None, mlp_recompute_gelu=False, achieved_tflops=None, flops_efficiency=None, hbm_memory_efficiency=1, intra_node_memory_efficiency=1.0, inter_node_memory_efficiency=1.0, num_gpus_per_node=8, output_dir=None, output_file_suffix='')[source]

Entry point function of training analysis for the command line interface. This uses pre-defined name-to-configuration mapping and common arguments to construct LLMAnalysis.

Parameters:
  • model_name (str, optional) – model name to query the pre-defined model_configs dict, or model config json file path, if not found, query Hugging Face to construct ModelConfig. Defaults to “facebook_opt-1.3b”.

  • gpu_name (str, optional) – gpu name to query the pre-defined gpu_configs dict. Defaults to “a100-sxm-40gb”.

  • dtype_name (str, optional) – data type name to pre-defined dtype_configs dict. Defaults to “w16a16e16”.

  • log_level (str, optional) – logging level. Defaults to “INFO”.

  • batch_size_per_gpu (int, optional) – batch size per GPU (micro batch size). Defaults to None.

  • gradient_accumulation_steps (int, optional) – gradient accumulation steps. Defaults to None.

  • global_batch_size (int, optional) – global batch size. Defaults to None.

  • seq_len (int, optional) – sequence length. Defaults to None.

  • total_num_tokens (int, optional) – total number of tokens used for training. Defaults to None.

  • activation_recomputation (int, optional) – activation recomputation strategy. See ActivationRecomputation. Defaults to 0.

  • ds_zero (int, optional) – which DeepSpeed ZeRO stage to use. See DSZeRO. Defaults to 0.

  • dp_size (int, optional) – data parallelism size. Defaults to None.

  • tp_size (int, optional) – tensor parallelism size. Defaults to 1.

  • pp_size (int, optional) – pipeline parallelism size. Defaults to 1.

  • sp_size (int, optional) – sequence parallelism size. Defaults to tp_size.

  • ep_size (int, optional) – expert parallelism size. Defaults to 1.

  • total_num_gpus (int, optional) – total number of GPUs used for training. Defaults to None.

  • layernorm_dtype_bytes (int, optional) – number of bytes in the data type for the layernorm activations. Often has to be FP32 in training to maintain model accuracy. Defaults to BYTES_FP32.

  • master_weights_dtype_bytes (int) – the number of bytes in the data type for the optimizer master weights. Defaults to BYTES_FP32.

  • other_op_bytes (int, optional) – the number of bytes in the optimizer state. Defaults to None, which assumes using Adam optimizer.

  • flash_attn (bool, optional) – whether to use Flash Attention. Defaults to True.

  • softmax_dropout (bool, optional) – whether to apply dropout after softmax. Defaults to False.

  • mlp_activation_quant_bits (int, optional) – number of bits to quantize MLP activations; if set, override the values for mlp_1linear_quant_bits, mlp_gelu_input_quant_bits and mlp_2linear_quant_bits. Defaults to None.

  • mlp_1linear_quant_bits (int, optional) – number of bits to quantize the input activations of the first linear layer. Defaults to None.

  • mlp_gelu_input_quant_bits (int, optional) – number of bits to quantize the GELU input activations. Defaults to None.

  • mlp_2linear_quant_bits (int, optional) – number of bits to quantize the input activations of the second linear layer. Defaults to None.

  • mlp_activation_quant_bits – number of bits for the quantized MLP activation. Defaults to None.

  • mlp_recompute_gelu (bool, optional) – whether to recompute the GELU activation in the MLP backward pass. Defaults to False.

  • achieved_tflops (float, optional) – achieved TFLOPS per GPU. Defaults to None.

  • flops_efficiency (float, optional) – flops efficiency, ranging from 0 to 1. Defaults to None.

  • hbm_memory_efficiency (float, optional) – GPU HBM memory efficiency, ranging from 0 to 1. Defaults to HBM_MEMORY_EFFICIENCY.

  • intra_node_memory_efficiency (float, optional) – intra-node memory efficiency, ranging from 0 to 1. Defaults to INTRA_NODE_MEMORY_EFFICIENCY.

  • inter_node_memory_efficiency (float, optional) – inter-node memory efficiency, ranging from 0 to 1. Defaults to INTER_NODE_MEMORY_EFFICIENCY.

  • num_gpus_per_node (int, optional) – number of GPUs per node. Defaults to NUM_GPUS_PER_NODE (8).

  • output_dir (str, optional) – if set to a directory path, write the return summary dict out to the directory with the setup. Defaults to None.

Returns:

a summary dictionary of the training analysis

Return type:

dict

llm_analysis.analysis.within_range(val, target, tolerance)[source]