jdit.parallel¶
SupParallelTrainer¶
-
class
jdit.parallel.
SupParallelTrainer
(unfixed_params_list: list, train_func=None)[source]¶ Training parallel
Parameters: - default_params – a
dict()
like{param_1:d1, param_2:d2 ...}
- unfixed_params_list – a
list
like[{param_1:a1, param_2:a2}, {param_1:b1, param_2:b2}, ...]
.
Note
You must set the value of
task_id
andgpu_ids_abs
, regardless indefault_params
orunfixed_params_list
.{'task_id': 1`}
,{'gpu_ids_abs': [0,1]}
.- For the same
task_id
, the tasks will be executed sequentially on the certain devices. - For the different
task_id
, the will be executed parallelly on the certain devices.
Example:
unfixed_params_list = [ {'task_id':1, 'lr':1e-3,'gpu_ids_abs': [0] }, {'task_id':1, 'lr':1e-4,'gpu_ids_abs': [0] }, {'task_id':2, 'lr':1e-5,'gpu_ids_abs': [2,3] }]
This set of
unfixed_params_list
means that:time ‘task_id’:1 ‘task_id’:2 t ‘lr’:1e-3, ‘gpu_ids_abs’: [0] ‘lr’:1e-5, ‘gpu_ids_abs’: [2,3] executed parallelly t+1 ‘lr’:1e-4, ‘gpu_ids_abs’: [0] executed sequentially -
build_task_trainer
(unfixed_params: dict)[source]¶ You need to write this method to build your own
Trainer
.This will run in a certain subprocess. The keys of
params
are compatible withdataset
,Model
,Optimizer
andTrainer
. You can see parameters in the following example.These two parameters are special.
params["logdir"]
controls the log directory.params["gpu_ids_abs"]
controls the running devices.
You should return a
Trainer
when you finish you building.Parameters: params – parameters dictionary. Returns: Trainer Example:
# Using ``params['key']`` to build your Trainer. logdir = params["logdir"] # necessary! gpu_ids_abs = params["gpu_ids_abs"] # necessary! use_benchmark = params["use_benchmark"] data_root = params["data_root"] batch_shape = params["batch_shape"] opt_name = params["opt_name"] lr = params["lr"] lr_decay = params["lr_decay"] lr_minimum = params["lr_minimum"] weight_decay = params["weight_decay"] momentum = params["momentum"] betas = params["betas"] init_method = params["init_method"] depth = params["depth"] mid_channels = params["mid_channels"] nepochs = params["nepochs"] torch.backends.cudnn.benchmark = use_benchmark mnist = FashionMNIST(root=data_root, batch_shape=batch_shape) T_net = Model(Tresnet18(depth=depth, mid_channels=mid_channels), gpu_ids_abs=gpu_ids_abs, init_method=init_method) opt = Optimizer(T_net.parameters(), lr, lr_decay, weight_decay, momentum, betas, opt_name, lr_minimum=lr_minimum) Trainer = FashingClassTrainer(logdir, nepochs, gpu_ids_abs, T_net, opt, mnist) # You must return a Trainer! return Trainer
-
error
(msg)[source]¶ When a subprocess failed, it will be called.
You can rewrite this method for your purpose. :param msg: error massage
- default_params – a