Week 10 & Week 11 - August 7, 2023

lakshmi97
Published: 08/23/2023

Carbonate issues, GPU availability, Tensorflow errors: Week 10 & Week 11

=========================================================

 

What I did this week

~~~~~~~~~~~~~~~~~~~~

Recently, I've been an assigned RP(Research Project) account on University of Bloomington's HPC cluster - Carbonate. This account lets me access multiple GPUs for my experiments in a dedicated account.

Once I started configuring my sbatch file accordingly, I started running into issues like GPU access. My debug print statements revealed that I'm accessing 1 CPU despite configuring the sbatch job for more than 1 GPUs. I double checked my dataloader definition, DistributionStrategy, train function. I read through IU's blogs as well as other resources online to see if I'm missing something.

Nothing worked, my mentor eventually asked me to raise a IT request on Carbonate, the IT personnel couldn't help either. This could only mean that Tensorflow is picking upon assigned GPUs. So, on my mentor's suggestion, I loaded an older version of the deeplearning module 2.9.1(used 2.11.1 earlier). This worked!

This also meant using a downgraded version of tensorflow(2.9). This meant I ran into errors again, time taking yet resolvable. I made some architectural changes - replaced GroupNorm with BatchNorm layers, tensor_slices based DataLoader to DataGenerator - to accommodate for the older tensorflow version. Additionally, I also had to change the model structure from a list of layers to ``tensorflow.keras.Sequential`` set of layers with input_shape information defined in the first layer. Without this last change, I ran into ``None`` object errors.

Once all my new code was in place, the week ended, hahahah. And also GPU's were in scarcity in the same week. I'm glad I got some work done though.

 

What Is coming up next week

~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run more experiments!

 

Did I get stuck anywhere

~~~~~~~~~~~~~~~~~~~~~~~~

All I did was get stuck again & again :P

But all is well now.

 

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages