Huawei Canada has an immediate 12-month contract opening for a Machine Learning
Software Engineer.
About the team:
The Software-Hardware System Optimization Lab continuously improves the power
efficiency and performance of smartphone products through software-hardware
systems optimization and architecture innovation. We keep tracking the trends of
cutting-edge technologies, building the competitive strength of mobile AI,
graphics, multimedia, and software architecture for mobile phone products.
About the job:
-
Profile and optimize end-to-end ML workloads and kernels to improve latency,
throughput, and efficiency across GPU/NPU/CPU.
-
Identify bottlenecks (compute, memory, bandwidth) and land fixes: tiling,
fusion, vectorization, quantization, mixed precision, layout changes.
-
Build/extend tooling for benchmarking, tracing, and automated regression/perf
testing.
-
Collaborate with compiler/runtime teams to land graph- and kernel-level
improvements.
-
Apply ML/RL-based techniques (e.g., cost models, schedulers, autotuners) to
search better execution plans.
-
Translate promising research/prototypes into reliable, scalable production
features and services.
The target annual compensation (based on 2080 hours per year) ranges from
$78,000 to $168,000 depending on education, experience and demonstrated
expertise.
About the ideal candidate:
-
Master or PhD degree in Computer Science or related fields. Solid experience
in ML systems or performance engineering (industry, OSS, or research).
Fluency in Python and C++.
-
Hands-on with at least one compute stack: CUDA/ROCm, OpenCL, Metal/Vulkan
compute, Triton, vendor or open source NPUs.
-
Practical knowledge of PyTorch or TensorFlow/JAX and inference/training
performance basics (mixed precision, graph optimizations, quantization).
-
Ability to turn ambiguous perf problems into measurable, repeatable
experiments.
-
AI compiler exposure: TVM, IREE, XLA/MLIR, TensorRT, or similar. Profiling
skills (Nsight, perf, VTune, CUPTI/ROCm tools) and comfort reading
roofline/memory-hierarchy signals.
-
Experience with kernel scheduling/auto-tuning (RL, Bayesian/EA search) and
hardware counters.
-
Background with custom accelerators/NPUs, DMA/tiling/SRAM management, or
quantization (INT8/FP8).
-
Contributions to relevant OSS (links welcome).