eBPF Compiler Optimization
Optimized eBPF bytecode at the LLVM level for JIT vectorization, cutting memset instruction count by ~50% for >300B structs. Benchmarked Tracee, Tetragon, and Sysdig with lmbench/perf to measure CPU cycle overhead.
Systems & AI Engineer
I work on systems-level code and AI tooling. Right now that means eBPF programs at the kernel level and LangGraph agents that can actually get things done.
I like knowing how things work under the hood. Most of my time goes into writing eBPF programs that hook into the kernel, CUDA code for GPU workloads, and building AI agents that do more than just chat.
I did my Master's at CU Boulder (3.95 GPA), where I spent time on WebRTC infrastructure and kernel-level packet processing. Before that I was doing ML research at IIT Patna and building full-stack tools at SimPPL and the University of Mumbai.
Right now at HeyNoah, I'm writing LangGraph agents with tool-use, memory, and human-in-the-loop checkpoints for enterprise workflows.
M.S. Computer Science, CU Boulder
GPA: 3.95 / 4.00
Systems Programming, AI Agents, Distributed Computing
HeyNoah - Palo Alto, CA
Independent Research, CU Boulder - Boulder, CO
University of Colorado Boulder - Boulder, CO
SimPPL - Mumbai, India
University of Mumbai - Mumbai, India
IIT Patna - Patna, India
University of Colorado Boulder - Boulder, CO
Design & Analysis of Algorithms, Compiler Construction, Distributed Systems, NLP, Linux System Administration, Data Center Scale Computing, Modern Computing Systems
University of Mumbai - Mumbai, India
Data Structures, Operating Systems, Database Management Systems, Computer Networks, Machine Learning, Distributed Computing, Information Security, Software Engineering, Discrete Mathematics
Optimized eBPF bytecode at the LLVM level for JIT vectorization, cutting memset instruction count by ~50% for >300B structs. Benchmarked Tracee, Tetragon, and Sysdig with lmbench/perf to measure CPU cycle overhead.
Custom CUDA kernels for batched vector add/mul/scale, 10x faster than single-threaded CPU on 1M+ element arrays. Used coalesced memory access, pinned host memory, and async streams to maximize throughput.
Compiler that takes Python expressions, parses them through a custom AST, and spits out SQL or MongoDB queries. Maps Pandas DataFrames to relational tables with schema inference. 10x faster than doing it through an ORM.
MPI + OpenMP sorting benchmark in C++. Partitions data across nodes with MPI, then quicksorts locally with OpenMP threads. Tested scaling from 1 to 16 ranks to see where communication overhead starts to dominate.
GNN-based phishing detector for Ethereum in PyTorch. Converts transaction histories into temporal ego-graphs, trains a classifier on them, and hits 98% F-score. The ego-graph trick also sped up training by 20%.
Trained 3 different transformer setups on text+audio+video data to figure out what causes emotional responses. Ran ablation studies across modalities, got to 85% F-score. Split model components across cloud instances with async message-passing for 20% better throughput.