forked from ttsse/ContrastiveLosses
-
Notifications
You must be signed in to change notification settings - Fork 0
/
set_tf_config_berzelius.py
191 lines (137 loc) · 8.03 KB
/
set_tf_config_berzelius.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import os
import json
import csv
def set_tf_config():
"""
This functions sets the necessary values in the TF_CONFIG environment variable. It contains information on the
cluster architectures, i.e, which workers are allocated for the job, and the worker for the current task. This has
been specifically been developed for using the SLURM task manager, and kind of hardcoded for the ouput given using
the Berzelius supercomputer at NSC. It may be the case that this funciton does not work on other clusters without
some changes.
Here, the outputted string s from the call to os.environ["SLURM_JOB_NODELIST"], contains all the allocated
workers for the job.
Examples:
s = "Node021"
s = "Node[036-039]"
s = "Node[009-012,047]"
We need to translate this format into a separated list of strings representing the nodes to be able to
describe the cluster in a way that tf.distribute.MultiWorkerMirroredStrategy can interpret the cluster. That
is we want:
s = "Node021" -> ["Node021"]
s =" Node[036-039]" -> ["Node036", "Node037", "Node038", "Node039"]
s = "Node[009-012,047]" -> ["Node009","Node010","Node011","Node012", "Node047"]
This is what is done below.
An example for the case s = Node[009-012,047] is followed within the comments.
UPDATE: Some experiments indicate that I would want to have one process per gpu, instead of one process per node.
I can't imagine that I would need to make drastic changes, but this script needs to be updated according to this.
Probably the easiest case to look at is when we have the same number of GPUS on each node, but for a more general
implementation, we should be able to have for example 8 on one, and just 1 on another.
This can be handled as is, but want it to work for the case with 1 process per GPU.
Start by at least assuming that we have the same number of gpus per node, I would have to make something special in the sbatch call if I would
want to have say 7+5 gpus. Should possibly work in future.
"""
s = os.environ["SLURM_JOB_NODELIST"] #example: s = "Node[009-012,047]"
#print(os.environ["CUDA_VISIBLE_DEVICES"] )
#os.environ["CUDA_VISIBLE_DEVICES"] = "0" # os.environ["SLURM_LOCALID"]
#print("GPUS : ", os.environ["CUDA_VISIBLE_DEVICES"] )
if s.find("[") == -1: # The case with only one node, has no brackets. Finds the node number.
s4 = [str(s[s.find("e") + 1:])]
else:
s2 = s[s.find("[") + 1: s.find("]")] # s2 = "009-012,047"
s3 = s2.split(",") # s3 = ["009-012","047"]
s4 = []
for i in s3:
j = i.find("-")
if j != -1:
s5 = i.split("-")
a = int(s5[0])
while a < int(s5[1]) + 1:
s4.append(str(a))
a += 1
else:
s4.append(i) # s3 = ["009","010","011","012","047"]
#print(s4)
# The node numbering is done using three digits, padded with zeros if necessary.
number_of_zeros = [3 - len(i) for i in s4]
clust = ["node" + "0" * i[0] + i[1] for i in zip(number_of_zeros, s4)] # Clust = ["Node009","Node010","Node011","Node012", "Node047"]
# All of the above should hold most likely
# Now, I want to know how many tasks we have. Could possibly want to use the env variable SLURM_TASKS_PER_NODE
# This may be good as well SLURM_LOCALID
# I assume that I would need to open different ports for different processes
port ="888" # Choose a port number to use, use 888 as base, the processes will then have ports 8880, 8881, 8882, etc
port ="8"+ os.environ["SLURM_JOBID"][-3:] # New version, the port number will now be 8zzy, where zz is the last two digits in the jobID, and y being the local task id
# In order to communicate, the nodes need to be supplied with port numbers (This is something that I do not
# really understand).
#clust_with_ports = [s + ":"+port for s in
# clust] # = ["Node009:8888","Node010:8888","Node011:8888","Node012:8888", "Node047:8888"]
# This outputs the node used for the specific task, where most likely we want to have 1 node corresponding to 1
# task. Use this to check if it is the first worker. The first worker is usually appointed some extra tasks in
# addition to training. This can for example be printing stuff etc, just using print() will print using all
# tasks, and we will just get extra print statements.
#num_workers = len(clust_with_ports)
num_tasks = int(os.environ["SLURM_NTASKS"]) // len(clust) # Here we assume that the number of GPUS is the same across all nodes.
## This should really be the only thing that needs to be changed in order to handle different number of gpus on different nodes
# I can't find a smart way of finding the number of gpus alocated per node, if this number is different for different nodes.
# SLURM_TASKS_PER_NODE=2(x3),1, translate this into a list or array of [2,2,2,1], then loop over this.
string = os.environ["SLURM_TASKS_PER_NODE"]
string2 = string.split(",")
num_gpus_per_node = []
for i in string2:
ind = int((i.find("x")))
if not ind == -1:
for j in range(int(i[ind+1])):
num_gpus_per_node.append(int(i[0]))
else:
num_gpus_per_node.append(int(i[0]))
clust_with_ports = []
for i in range(len(clust)):
for j in range(num_gpus_per_node[i]):
#print(i)
clust_with_ports.append(clust[i]+":"+port+str(j))
#print(clust[i])
num_workers = len(clust_with_ports)
if int(os.environ["SLURM_PROCID"]) ==0 :
print(clust_with_ports)
clust_with_ports = []
for i in range(len(clust)):
for j in range(num_gpus_per_node[i]):
#print(i)
clust_with_ports.append(clust[i]+":"+port+str(j))
#print(clust[i])
num_workers = len(clust_with_ports)
if int(os.environ["SLURM_PROCID"]) ==0 :
print(clust_with_ports)
t = os.environ["SLURMD_NODENAME"]
# Find at which index the current Node is, if it is the first node in the job, this is appointed chief status.
# This is also used as an output from this function
ind = clust.index(t)* num_tasks + int(os.environ["SLURM_LOCALID"])
#print("ind: ", ind)
if ind == 0 and int(os.environ["SLURM_PROCID"])==0:
role = "worker"
chief = t
else:
role = "worker"
ind = ind
chief = 0
"""
If we explicitly appoint a worker as the chief, it seems to not take part in training. This can be done in this manner:
cfg = {'cluster': {'chief': [clust_with_ports[0]], 'worker' : clust_with_ports[1:] },
#'cluster': {'worker': clust},
'task': {'type': role,'index': ind,},
'rpc_layer': 'grpc',}
Here I say that the first node is the chief, and the rest are workers, i.e., first node does no computational work. This is most likely nto what I want.
I want all to be working, but I want worker with index 0 to in addition be responsible for printing etc.
"""
cfg = {
'cluster': {'worker': clust_with_ports},
'task': {'type': role, 'index': ind, },
'rpc_layer': 'grpc' }
# These addresses with the "grpc://" prefixes are needed when doing profiling (I think)- profiling multiple
# workers seems hard.
#addresses = [",".join(["grpc://" + c + ":"+port for c in clust])]
addresses = [",".join(["grpc://" +str(ind)+ c for c in clust_with_ports])]
addresses = [",".join(["grpc://" + c for c in clust_with_ports])]
#print(addresses)
# Now we have the full tf config variable, write it to the os.environ to set it as a environment variable.
os.environ['TF_CONFIG'] = json.dumps(cfg)
return addresses, chief, num_workers