Senior Engineer - Ray

Huawei Technologies Canada Co., Ltd. • Markham • 1w ago

Huawei Canada has an immediate permanent opening for a Senior Engineer.

About the team:

The Centre for Software Excellence Lab conducts pioneering research in software engineering, focusing on next-generation technologies. This team integrates industry best practices with cutting-edge academic research to address lifecycle software engineering challenges, including foundation model applications, software performance engineering, hyper-cluster programming, next-gen mobile OS, and cloud-native computing. This lab uniquely allows researchers to apply innovations directly to products affecting billions of customers while promoting open-source contributions, publications, conference participation, and collaborations to create a broader impact.

About the job:

Design and implement scalable infrastructure for various AI/LLM workloads, including but not limited to model pre-training, post training, reinforcement learning, multi modal data processing, model serving, etc.
Contribute to open-source projects and stay updated with the latest developments in AI infrastructure (e.g., Ray, vLLM, veRL)
Develop and maintain data pipelines using tools like Ray Data to handle large-scale datasets efficiently.
Optimize system performance and resource utilization across heterogeneous computing environments.
Collaborate with cross-functional teams to integrate infrastructure solutions into existing ML pipelines.
Meet top industry and academic leaders and experts around the world, collaborate with top researchers and students, consult with Engineering teams across diverse domains, publish research papers in far-reaching and impactful areas, and submit patent applications for novel inventions.

About the ideal candidate:

Bachelors/Master/Ph.D Degree in Computer Science, Electrical & Computer Engineering, Machine Learning, or relevant domains.
Experience with large language models (LLMs) and related infrastructure.
Solid experience with one or more of the following programming languages: Python/C++; Familiarity with software development practices (version management, build management, CI/CD, debugging and profiling).
Solid understanding in any of these areas: Machine Learning and/or Deep Learning, Large Models Training and Finetuning (e.g., NLP/CV).
Experience with mainstream model training and inference frameworks and tools (e.g., PyTorch, HuggingFace Transformer&Accelerate, DeepSpeed, Megatron, veRL).
Experience in using frameworks and tools of any of the aforementioned areas (e.g., Spark, Flink, Ray for Distributed Computing, Docker, K8S for Cloud-Native app/framework development).
Ability to evaluate, apply, and mature published research to real-world problems on prototype systems and have an inquisitive mindset, proven research and communication.