I have a question regarding to clustering. From the lecture I understand that clusters
are basically bunch of computers communicating with each other and if one of them
goes down other machine will kick in as explained in the lecture.
Does this mean that the other computer will compensate resources for the computer
that died ? What I mean is that if the computer has 8GB of RAM and the job its doing
eats up 4 or 5 GB and it dies the other computer will create jobs and allow extra RAM
to handle the job dead computer was doing ?
What exactly “kick in” means in this scenario ?
Well it depends on the cluster that is setup as a backup. If it has that amount of RAM then yes, it will provide those resources. If not then the job will stop or pause based on what options it was started with. The kick in basically means the control of the running program will go to the backup cluster and it will handle the processing. It is not easy as it sounds as there are many technicalities like will the program pause or just crash and then restart when it goes to another cluster or if there will be any race conditions.
OK, thank you. So I’ll take it as it depends on how the system admin configured the clusters.