jdit.parallel

SupParallelTrainer

class jdit.parallel.SupParallelTrainer(unfixed_params_list: list, train_func=None)[source]

Training parallel

Parameters:
  • default_params – a dict() like {param_1:d1, param_2:d2 ...}
  • unfixed_params_list – a list like [{param_1:a1, param_2:a2}, {param_1:b1, param_2:b2}, ...].

Note

You must set the value of task_id and gpu_ids_abs, regardless in default_params or unfixed_params_list.

{'task_id': 1`} , {'gpu_ids_abs': [0,1]} .

  • For the same task_id , the tasks will be executed sequentially on the certain devices.
  • For the different task_id , the will be executed parallelly on the certain devices.

Example:

unfixed_params_list = [
    {'task_id':1, 'lr':1e-3,'gpu_ids_abs': [0] },
    {'task_id':1, 'lr':1e-4,'gpu_ids_abs': [0] },
    {'task_id':2, 'lr':1e-5,'gpu_ids_abs': [2,3] }]

This set of unfixed_params_list means that:

time ‘task_id’:1 ‘task_id’:2  
t ‘lr’:1e-3, ‘gpu_ids_abs’: [0] ‘lr’:1e-5, ‘gpu_ids_abs’: [2,3] executed parallelly
t+1 ‘lr’:1e-4, ‘gpu_ids_abs’: [0]  
  executed sequentially  
build_task_trainer(unfixed_params: dict)[source]

You need to write this method to build your own Trainer.

This will run in a certain subprocess. The keys of params are compatible with dataset , Model , Optimizer and Trainer . You can see parameters in the following example.

These two parameters are special.

  • params["logdir"] controls the log directory.
  • params["gpu_ids_abs"] controls the running devices.

You should return a Trainer when you finish you building.

Parameters:params – parameters dictionary.
Returns:Trainer

Example:

# Using ``params['key']`` to build your Trainer.
logdir = params["logdir"] # necessary!
gpu_ids_abs = params["gpu_ids_abs"] # necessary!
use_benchmark = params["use_benchmark"]
data_root = params["data_root"]
batch_shape = params["batch_shape"]
opt_name = params["opt_name"]
lr = params["lr"]
lr_decay = params["lr_decay"]
lr_minimum = params["lr_minimum"]
weight_decay = params["weight_decay"]
momentum = params["momentum"]
betas = params["betas"]
init_method = params["init_method"]
depth = params["depth"]
mid_channels = params["mid_channels"]
nepochs = params["nepochs"]

torch.backends.cudnn.benchmark = use_benchmark
mnist = FashionMNIST(root=data_root, batch_shape=batch_shape)
T_net = Model(Tresnet18(depth=depth, mid_channels=mid_channels), gpu_ids_abs=gpu_ids_abs,
              init_method=init_method)
opt = Optimizer(T_net.parameters(), lr, lr_decay, weight_decay, momentum, betas, opt_name,
                lr_minimum=lr_minimum)
Trainer = FashingClassTrainer(logdir, nepochs, gpu_ids_abs, T_net, opt, mnist)
# You must return a Trainer!
return Trainer
error(msg)[source]

When a subprocess failed, it will be called.

You can rewrite this method for your purpose. :param msg: error massage

finish(msg)[source]

When a subprocess finished, it will be called.

You can rewrite this method for your purpose. :param msg: fin

train(max_processes=4)[source]

start parallel task

To start the parallel task that were saved in self.parallel_plans dictionary.

Parameters:max_processes – A max amount of processes for setting Pool(processes = ?) method.