Bitsandbytes documentation

SGD

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.45.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SGD

Stochastic gradient descent (SGD) is a basic gradient descent optimizer to minimize loss given a set of model parameters and updates the parameters in the opposite direction of the gradient. The update is performed on a randomly sampled mini-batch of data from the dataset.

bitsandbytes also supports momentum and Nesterov momentum to accelerate SGD by adding a weighted average of past gradients to the current gradient.

SGD

class bitsandbytes.optim.SGD

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float) — The learning rate.
  • momentum (float, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps.
  • dampening (float, defaults to 0) — The dampening value reduces the momentum of the optimizer.
  • weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
  • nesterov (bool, defaults to False) — Whether to use Nesterov momentum.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.

Base SGD optimizer.

SGD8bit

class bitsandbytes.optim.SGD8bit

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float) — The learning rate.
  • momentum (float, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps.
  • dampening (float, defaults to 0) — The dampening value reduces the momentum of the optimizer.
  • weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
  • nesterov (bool, defaults to False) — Whether to use Nesterov momentum.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.

8-bit SGD optimizer.

SGD32bit

class bitsandbytes.optim.SGD32bit

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float) — The learning rate.
  • momentum (float, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps.
  • dampening (float, defaults to 0) — The dampening value reduces the momentum of the optimizer.
  • weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
  • nesterov (bool, defaults to False) — Whether to use Nesterov momentum.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.

32-bit SGD optimizer.

< > Update on GitHub