This job offer is not available in your country.

High-Performance Computing Senior Engineer

DSO National LaboratoriesQueenstown, Otago, New Zealand

22 days ago

Job description

JOB DESCRIPTION

DSO National Laboratories (DSO) is Singapore’s largest defence research and development (R&D) organisation, with the critical mission to develop technological solutions to sharpen the cutting edge of Singapore's national security. At DSO, you will develop more than just a career. This is where you will make a real impact and shape the future of defence across the spectrum of air, land, sea, space and cyberspace.

The Digital Division leads the digital transformation of DSO through the master planning and policies, delivering digital capabilities through IT infrastructure, and providing one stop service to corporate and R&D Divisions. The Digital Division will transform the way we work, our workplace, and the capabilities we deliver to the MINDEF / SAF and for the security of Singapore.

People are DSO’s greatest asset. You will get to realise your career aspirations and develop your own niche either as a deep technical expert or a leader in the team. With frequent career dialogues and a robust training and development framework, we will provide you with the necessary development tools for you to reach your potential. You will also be recognised and rewarded through competitive remuneration packages and scholarship opportunities.

High-Performance Computing Senior Engineer

Responsibilities

Ensure the reliable operations of the central GPU Clusters used for AI training and High-Performance Computing (HPC) Clusters
Advise users on workload execution and optimization strategies
Provide users support for resources they need
Support the maintenance and troubleshooting of AI and HPC infrastructure to ensure system stability. Work with the OEM vendor for troubleshooting and part replacements
Manage day-to-day operations of the GPU cluster, HPC cluster, distributed storage system and other associated IT infrastructure (e.g. head nodes)

JOB REQUIREMENTS

Degree in Computer Science / Computer Engineering

Experience with HPC scheduling and workload management tools (e.g., Run.AI and SLURM will be preferred)

Experience in managing parallel file systems (e.g., Lustre), with a strong understanding of HPC storage principles

Experience with cluster management software (e.g., BCM)

Proficient in Python and Bash scripting for automation tasks

Experience with container technologies (e.g., Docker); container orchestration using Kubernetes is a plus

Understanding of basic network protocols (e.g., DHCP, DNS, SSH, SCP, SMTP)

Proficient in UNI / Linux operating systems and command-line interfaces (e.g., Ubuntu, Red Hat)

Familiar with monitoring tools (e.g. / Prometheus, Grafana, PRTG, Environet)

Good knowledge and experience in HPC performance optimization and troubleshooting

Proven working knowledge of HPC system and software

Strong programming skill in Python and Bash scripting

Familiarity with HPC schedulers (e.g., SLURM), container orchestration (e.g., Kubernetes), and GPU based systems

SKILLS

PARALLEL COMPUTING

DISTRIBUTED SYSTEMS

CLUSTER MANAGEMENT

JOB ID

EXPERIENCE

5 ~ 10 years

#J-18808-Ljbffr

Create a job alert for this search

Senior Engineer • Queenstown, Otago, New Zealand