.png?1697105647)
Compensation
Salary undisclosedDescription
Your impact
You will collaborate to build a high-fidelity biological data layer that serves as the foundation for machine learning at Isomorphic Labs. Moving beyond raw data ingestion to create curated biological datasets, you will ensure model training is consistently grounded in high-quality, standardized, and version-controlled biological data. Harmonizing disparate public datasets and internal data into a coherent representation, you will unlock the information needed to fuel our mission to solve all disease.
Working in an interdisciplinary environment, you will partner with ML Research, Computational Biology, Drug Development, and Chemistry teams to drive the adoption of standardized bioinformatics primitives and best practices into their daily workflows. Your work will provide projects with a significant head start by solving complex bioinformatics problems at the platform level. Ultimately, your contribution ensures that our models are built on a robust and integrated data resource, producing predictions on a coherent view of biology, directly accelerating our Drug Discovery programs.
What you will do
- Develop and operate large-scale bioinformatics pipelines for high-throughput data analysis, ensuring reliable processing from raw data (e.g., FASTQ, BAM, mzXML) to ML-ready datasets.
- Apply bioinformatics best practices to the ingestion and harmonization of complex datasets, ensuring model training is grounded in high-quality, version-controlled biological data, and coherently integrated datasets.
- Harmonise disparate public databases (e.g., Ensembl, UniProt, Reactome, Open Targets), implementing rigorous versioning and mapping strategies to mitigate identifier collisions, data loss, and semantic drift across releases.
- Act as a strategic partner to our ML Research, Computational Biology, Drug Development, and Chemistry teams, championing the adoption of the internal bioinformatics platform and standardized biological data primitives into their daily research workflows.
- Participate in research projects as a "Deployed Engineer", providing customised solutions, and identifying technical gaps, while ensuring project-specific insights are contributed back into the core bioinformatics platform.
- Provide documentation, guidance, and training on data resources and curation processes to the wider organization.
Skills and qualifications
Essential:
- Proven experience with the large-scale processing of raw bioinformatics data (e.g., FASTQ, BAM, mzXML).
- A demonstrable track record of delivering high-quality bioinformatics outputs across varied modalities (e.g., genomics, proteomics, functional genomics, systems biology, single cell).
- Experience delivering bioinformatics solutions directly to research teams, scientific communities, or industry projects, with a strong focus on user enablement.
- Experience writing production-grade code in Python and developing automated, scalable bioinformatics pipelines.
- PhD or MSc in Bioinformatics, Computational Biology, or a related field, or equivalent practical experience in a biopharmaceutical or research environment.
Nice to have:
- Experience with domain-specific workflow systems (e.g., Nextflow) for scaling high-throughput pipeline execution.
- Familiarity with general-purpose data orchestration and processing frameworks (e.g., Dagster, Apache Beam) for integrating research pipelines into a production platform
- Familiarity with building and maintaining bioinformatics infrastructure on Google Cloud Platform (GCP).
- Familiarity with modern, high-performance DataFrame libraries (e.g., Polars), and relational data modeling and analysis (SQL).
- Exposure to machine learning concepts and the specific data requirements for training ML models.
- Demonstrable experience working with regulated PHI data.
- Extensive experience in software development with Python.
Stack
- Posted
- Jul 2, 2026
- Last seen
- Jul 2, 2026
- First seen
- Jul 2, 2026


