Machine Learning Software Engineer - GPU/NPU

Huawei Technologies Canada Co., Ltd. • Burnaby • $168k - $78k / year • 1w ago

Huawei Canada has an immediate 12-month contract opening for a Machine Learning Software Engineer.

About the team:

The Software-Hardware System Optimization Lab continuously improves the power efficiency and performance of smartphone products through software-hardware systems optimization and architecture innovation. We keep tracking the trends of cutting-edge technologies, building the competitive strength of mobile AI, graphics, multimedia, and software architecture for mobile phone products.

About the job:

Profile and optimize end-to-end ML workloads and kernels to improve latency, throughput, and efficiency across GPU/NPU/CPU.
Identify bottlenecks (compute, memory, bandwidth) and land fixes: tiling, fusion, vectorization, quantization, mixed precision, layout changes.
Build/extend tooling for benchmarking, tracing, and automated regression/perf testing.
Collaborate with compiler/runtime teams to land graph- and kernel-level improvements.
Apply ML/RL-based techniques (e.g., cost models, schedulers, autotuners) to search better execution plans.
Translate promising research/prototypes into reliable, scalable production features and services.

The target annual compensation (based on 2080 hours per year) ranges from $78,000 to $168,000 depending on education, experience and demonstrated expertise.

About the ideal candidate:

Master or PhD degree in Computer Science or related fields. Solid experience in ML systems or performance engineering (industry, OSS, or research). Fluency in Python and C++.
Hands-on with at least one compute stack: CUDA/ROCm, OpenCL, Metal/Vulkan compute, Triton, vendor or open source NPUs.
Practical knowledge of PyTorch or TensorFlow/JAX and inference/training performance basics (mixed precision, graph optimizations, quantization).
Ability to turn ambiguous perf problems into measurable, repeatable experiments.
AI compiler exposure: TVM, IREE, XLA/MLIR, TensorRT, or similar. Profiling skills (Nsight, perf, VTune, CUPTI/ROCm tools) and comfort reading roofline/memory-hierarchy signals.
Experience with kernel scheduling/auto-tuning (RL, Bayesian/EA search) and hardware counters.
Background with custom accelerators/NPUs, DMA/tiling/SRAM management, or quantization (INT8/FP8).
Contributions to relevant OSS (links welcome).