Huawei Canada has an immediate permanent opening for a Distinguished Engineer -
AI Computing System
About the team:
The Advanced Computing and Storage Lab, currently a part of the Vancouver
Research Centre, aims to explore adaptive computing system architectures to
address the challenges posed by flexible and variable application loads in the
future. It assists in ensuring the stability and quality of training clusters,
constructs dynamic cluster configuration strategy solvers, and establishes
precision control systems to create stable and efficient computing power
clusters. One of the lab's goals is to focus on key industry AI application
scenarios such as large model training/inference, based on key technologies like
low-precision training, multi-modal training, and reinforcement learning,
responsible for bottleneck analysis and the design and development of
optimization solutions, thereby improving training and inference performance as
well as usability.
About the job:
-
As a leading expert in the industry in the field of training cluster software
frameworks and technologies, gain insights into the evolution direction of
industry AI large model training frameworks and key features. Plan and layout
AI frameworks and software features for scenarios such as large model
pre-training, post-training, and integrated training and inference, building
key capabilities for the company's training cluster software framework.
-
Focusing on the company's large model training optimization field, lead the
team to build key technologies such as low-precision training, parallel
strategy tuning, and training resource optimization, promoting the commercial
implementation of large model perception optimization-related technologies.
-
Focusing on the company's training servers and super nodes and other
products, lead the team to build large model AI training frameworks, operator
libraries, acceleration libraries, and other software frameworks and
acceleration features, fully leveraging system engineering and
software-hardware collaboration capabilities to enhance AI cluster computing
efficiency.
-
Identify high-quality academic resources in the direction of large model
training, collaborate with domain experts and scholars on projects, layout
related standards and patents, support the company's continuous innovation in
the training cluster field, and build long-term competitiveness in the AI
training cluster direction.
-
Cultivate a team of technical experts and key technical backbone in the
direction of AI training cluster frameworks and software optimization.
The base salary for this position ranges from $172,000 to $230,000 depending on
education, experience and demonstrated expertise.
About the ideal candidate:
-
Major in artificial intelligence, computer science, software, automation,
physics, mathematics, electronics, microelectronics, information technology,
or related fields, with more than 5 years of R&D experience in large model
training and optimization.
-
Proficient in common model structures of large models such as Deepseek and
Llama, with deep technical expertise in large model training and inference
optimization in fields like LLM, MoE, and multimodal learning.
-
Familiar with the hardware architecture and programming systems of AI
accelerators such as GPU and NPU, with experience in optimizing AI systems
with software-hardware-cores collaboration.
-
Familiar with cluster computing and cloud computing fields, with experience
in software architecture design for cluster scheduling.
-
Enjoys research, has strong learning ability, good communication skills, and
teamwork ability.