Huawei Canada has an immediate permanent opening for a Senior Engineer.
About the team:
The Centre for Software Excellence Lab conducts pioneering research in software
engineering, focusing on next-generation technologies. This team integrates
industry best practices with cutting-edge academic research to address lifecycle
software engineering challenges, including foundation model applications,
software performance engineering, hyper-cluster programming, next-gen mobile OS,
and cloud-native computing. This lab uniquely allows researchers to apply
innovations directly to products affecting billions of customers while promoting
open-source contributions, publications, conference participation, and
collaborations to create a broader impact.
About the job:
-
Design and implement scalable infrastructure for various AI/LLM workloads,
including but not limited to model pre-training, post training, reinforcement
learning, multi modal data processing, model serving, etc.
-
Contribute to open-source projects and stay updated with the latest
developments in AI infrastructure (e.g., Ray, vLLM, veRL)
-
Develop and maintain data pipelines using tools like Ray Data to handle
large-scale datasets efficiently.
-
Optimize system performance and resource utilization across heterogeneous
computing environments.
-
Collaborate with cross-functional teams to integrate infrastructure solutions
into existing ML pipelines.
-
Meet top industry and academic leaders and experts around the world,
collaborate with top researchers and students, consult with Engineering teams
across diverse domains, publish research papers in far-reaching and impactful
areas, and submit patent applications for novel inventions.
About the ideal candidate:
-
Bachelors/Master/Ph.D Degree in Computer Science, Electrical & Computer
Engineering, Machine Learning, or relevant domains.
-
Experience with large language models (LLMs) and related infrastructure.
-
Solid experience with one or more of the following programming languages:
Python/C++; Familiarity with software development practices (version
management, build management, CI/CD, debugging and profiling).
-
Solid understanding in any of these areas: Machine Learning and/or Deep
Learning, Large Models Training and Finetuning (e.g., NLP/CV).
-
Experience with mainstream model training and inference frameworks and tools
(e.g., PyTorch, HuggingFace Transformer&Accelerate, DeepSpeed, Megatron,
veRL).
-
Experience in using frameworks and tools of any of the aforementioned areas
(e.g., Spark, Flink, Ray for Distributed Computing, Docker, K8S for
Cloud-Native app/framework development).
-
Ability to evaluate, apply, and mature published research to real-world
problems on prototype systems and have an inquisitive mindset, proven
research and communication.