AI Engineer, Multimodal (Vision + Speech)
Responsibilities
- Build multimodal AI features involving vision, speech, and text understanding.
- Fine-tune and evaluate vision and speech models for domain-specific tasks.
- Optimize inference pipelines for real-time interaction.
- Collaborate with product teams to design multimodal user experiences.
- Contribute to internal tooling and evaluation frameworks.
Requirements
- 3+ years of experience working with computer vision or speech models.
- Hands-on experience with PyTorch, TensorFlow, or JAX.
- Familiarity with multimodal model architectures (e.g., vision-language models).
- Experience with real-time audio or video processing pipelines is a plus.
- Strong math background (linear algebra, probability, optimization).