ML Systems · High Dimensional Data · Scalable Inference
Applied Research Scientist on adaptations computational intelligence (AI/ML) to real-world messiness — from efficient neural retrieval under network constraints, to document layout intelligence training and inferencing, to sensor datamodelling with advanced model methods. Eight years across national labs, academic research programs, and applied systems.
Connect
Selected Work
Three areas of sustained inquiry — each asking a variation of the same question: how do you make ML systems that actually work on messy, high dimensional, real-world data, at inference time, under real constraints?
01 — Thesis (04.2026)
"Is embedding algorithm architecture inherently resilient to traffic constraints — or does resilience have to be designed in, at a cost?"
Stress-tests HNSW and IVF ANN architectures across a controlled degradation ladder (baseline → bandwidth-constrained → high-latency → packet loss), mirroring cloud throttling, cross-region lag, and edge node failures. Core contribution: a diagnostic sensitivity framework — HNSW is latency-sensitive (sequential traversal); IVF is bandwidth-sensitive (fewer, larger transfers) — giving engineers a principled vocabulary for architecture-to-environment matching. Energy per query tracked as a first-class metric alongside recall and latency; the research asks which design decisions make retrieval systems inherently robust or fragile, and at what cost. In short: Most research asks how to make retrieval faster. My thesis asks which retrieval architectures are inherently fragile to network conditions — and what that fragility costs in energy
Thesis forthcoming →02 — Publication (In Preparation)
"What if metric instability isn't a statistics problem — it's a numerical conditioning problem?"
Co-authored paper reframing binary classifier evaluation as a well-posedness problem. We extend the binormal ROC model into a unified differentiable manifold linking ROC, PR, and F₁ simultaneously — enabling threshold optimization via Brent, Golden-section, and RK4 algorithms on a smooth, analytically grounded surface. Four of five optimizers converged to identical optima within 10⁻⁶ tolerance. Bootstrap experiments show >40% reduction in threshold variance under smoothing. Newton's method diverged, exposing the non-convex structure of empirical F₁ and the necessity of bounded search.
Preprint forthcoming →03 — Applied ML Engineering
"How do you turn a decade of PDF reports into a queryable knowledge base?"
Built a production document intelligence system that extracts structured table data from heterogeneous PDF corpora — the kind of documents where layout, encoding, and schema vary unpredictably. The key research contribution is the disambiguation layer: handling merged cells, multi-header tables, and rotated layouts without a fixed template. Now used on real document pipelines.
View on GitHub →04 — Program Leadership
"How do you build AI/ML-ready datasets for domains where none exist?"
Expanded open source science curriculum and research program for YouthMappers through training, projects, and creating open geospatial datasets from diverse sources for policy and ML research learning. Developed pedagogical frameworks for ML + GIS methods and trained student scale. The outputs — datasets, methods, and trained practitioners — are in active use in the research community. Work directly addressed training data limitations and new ML/DL model developments.
Program overview →Ongoing Series
Parallel & Distributed Systems Notes
A chapter-by-chapter study of parallel computing applied to ML systems & large-scale data structures. 3 entries published, ongoing.
Dynamic Resource
AI/ML Systems Research Links
Automatically curated papers and news on embedding systems, geospatial ML, and parallel computing. Updated weekly.
Scholarly Work
Peer-reviewed contributions across energy systems informatics, geological data infrastructure, and applied learning frameworks.
WELLBASE: A Standardized Data Infrastructure for Well Log Analytics
Geological, Oil & Gas data systems · Peer-reviewed
ROKBASE: Rock Sample Database for Imaging DL Applications
Imaging systems · Peer-reviewed
Lite Learning: A Lightweight Framework for Model Training in Resource-Constrained Environments
Model Training research · Peer-reviewed
Numerical Smoothing of Noisy Evaluation Surfaces: A Classical Approach to Robust ML Threshold Optimization
Nakacwa S., Luis P. · Harrisburg University · In preparation
Thesis: Architecture Resilience Under Network Degradation: A Controlled Benchmarking Study of Embedding Retrieval Systems
Harrisburg University · In preparation
Background
Eight years across mission-critical research, academic ML, and applied systems engineering — always working on the same class of problem from different vantage points - how do intelligent systems behave when the environment they were designed for stops cooperating?
AI/ML R&D Science and Engineering
Applies embedding knowledge and computer vision to categorize mineral regimes, and establish a computational basis for energy resource evaluation. Designed mathematical models and data pipelines to reconstruct fragmented oil and gas records from disparate sources into a unified, queryable national science asset. Develops inference systems that make inaccessible document archives — PDFs, scanned reports, legacy formats — machine-readable, to recover decades of domain knowledge for advanced research
Graduate Research
Computing Systems and Algorthms - Benchmarking robustness of retrieval algorithms architecture. Strengthem AI/ML and computing systems knowledge.
Graduate Research Assistant
Research project on ML Training Data & Large Data System Design. Addressed losses created when by db schema invariation for opensource data
Program Director, Regional Program Training
Expanded and Increased open science learning curriculums for GIS, ML application. Increased open geospatial dataset creation for coummunities. Program outputs supporting policy & AI/ML research.
Applied ML Research - Software Development
Produced document intelligence utility tool for structured extraction from heterogeneous PDF corpora. Focus on software architecture and design application for transformer models
Geospatial SME Expert
Mapped and advised schema variations and adaptations for Land Data Mapping and National Record Digitalisation. Data Management and Processing. Software Development.
Exploratory Work
Algorithmic tests, benchmark studies, and data models — work that expands the boundaries of knowledge to real-world incidents.
Behavioral Signal Extraction from Mobility Data
Treats COVID-19 lockdown periods as a natural experiment — asking what population movement signals reveal about how policy propagates through behavior, and whether real-time observation changes the answer.
GitHub →Infrastructure Adoption as a Spatial Problem
Models U.S. EV charging growth as a spreading pattern across geography - to where adoption moves next and what the shape of that curve reveals about energy transition timelines.
GitHub →Air Quality Geospatial Pipeline
Scrapy-based pipeline collecting EPA AirNow data for geospatial air quality analysis.
GitHub →Time Series Forecasting Compendium
Comparative study of classical statistical and deep learning forecasting models (ARIMA, Prophet) — where each architectural class breaks down and whether failure modes are predictable from the structure of the algorithm.
GitHub →Document Intelligence -OCR
Early tests in recovering structured data from documents never intended for machines- a precursor to the AIPDF2Table tool - A research table extraction and processing pipeline for PDF and multi-format document corpora.
GitHub →GPU Modernization of Floyd-Warshall
Extending and benchmarking of Floyd-Warshall all-pairs shortest path on CPU vs Modern GPUs — examining where the architecture assumptions change. Part of the PCAM methodology study; results and benchmarks in the repo.
GitHub →Get In Touch
I'm interested in research scientist and senior AI/ML engineering roles at organizations expanding knowledge/products on geometric data, earth systems, infrastructure-scale ML, or efficient retrieval.