01 — About
Hi, I’m Rishab — an AI researcher and engineering student from BITS Pilani who accidentally fell in love with deep learning during my third year. That first internship was the spark: I saw what machines could actually do, and I’ve been hooked ever since. These days, you’ll find me exploring the weird and wonderful world of AI architectures — from Transformers and Diffusion models to Mamba and beyond. I love digging into how models think: attention mechanisms, fine-tuning tricks like PEFT and distillation, and the magic behind text generation and image synthesis. But I’m not just here for the theory; I genuinely enjoy building things that work. Whether it’s a production system like Moody.AI or a research project in medical imaging, I love the journey from idea to deployment. Currently, I’m part of the Avatar team at FLAM, where I get to play with image generation — GANs, diffusion, normalizing flows — and even dabble in streaming tech like WebRTC (because why limit yourself?). If you’re into AI — whether it’s research, engineering, or just staying up late debating whether attention really is all you need — I’d love to connect. Let’s geek out, collaborate, or simply share ideas. The best conversations start with curiosity.
02 — Experience
03 — Research
Lightweight Fourier Block Transformer achieving 88.41% accuracy for real-time osteoporosis detection directly on Android devices using knee X-ray sensor images.
IEEE · 2026Novel multiband-frequency aware network achieving 92.22% accuracy on bone fracture detection benchmarks.
IET / Wiley · 2025Published in "Non-stationary and nonlinear data processing for automated computer-aided medical diagnosis".
Elsevier · 202504 — Projects
Multimodal AI that analyzes both audio and visual streams from YouTube videos.
Memory-efficient image segmentation via sequential model loading.
Emotion recognition using DINOv2, Wav2Vec2, and DistilBERT for multimodal fusion.
Android app for osteoporosis classification achieving 90% accuracy.
Real-time facial expression recognition using TensorFlow Lite on Android.
Real-time gesture recognition using FastViT achieving 97.5% accuracy.
Document processing and Q&A pipeline using DeepSeek and Llama models.
Checkers game with a Minimax AI opponent with alpha-beta pruning.
Gender detection using InceptionV3 achieving 94.35% accuracy.
05 — Skills
06 — Education
Hyderabad, India · Currently Studying
07 — Writing
The first Vision Mamba for Generalized Medical Image Classification — what it is, how it works, and why it matters.
Medium · 2024Flam (Flying Flamingoes Pvt. Ltd.) · Bangalore · Jan 2026 – Present
Working in the Avatar team on interactive talking‑head avatars for B2B products that enhance user experience.
Hamad Medical Corporation, Qatar · May 2025 – August 2025
Currently engaged in Emergency Research within the Department of Surgery at Hamad Medical Corporation, one of the leading academic medical centers in the Middle East.
BITS Pilani, Hyderabad · Aug 2024 – Dec 2025
As a Research Assistant in the Department of ECE under Prof Rajesh Kumar Tripathy at BITS Pilani, I contributed to pioneering research in medical imaging with a focus on deep learning architectures for clinical diagnostics.
IGCAR Kalpakkam · May 2024 – Aug 2024
Research internship at IGCAR, focused on advancing computer vision methods for industrial and scientific inspection applications.
Automated Detection and Classification of Respiratory Diseases Using Chest X-Ray Analysis
Automated Bone Fracture Detection
IEEE Sensors Letters · Vol. 10, Issue 3 · March 2026
Beyond the Transcript: True Multimodal YouTube Intelligence 🎥
U-Tube AI is a revolutionary multimodal AI agent that transforms YouTube videos into comprehensive knowledge assets by analyzing both audio and visual streams. Unlike mainstream tools (NoteGPT, Notta, MyMap.AI) that rely solely on transcripts, U-Tube AI employs adaptive frame sampling and OCR to capture slides, diagrams, and code shown on screen. This research-backed approach achieves 70-90% cost reduction compared to direct VLM processing while maintaining complete visual context that transcript-only tools miss entirely.
----------------------------------------------------------------------------------------------------------------
GitHub Repository: https://github.com/Rishab27279/U-Tube-AI
----------------------------------------------------------------------------------------------------------------
U-Tube AI addresses the critical limitation of existing AI note-taking tools that are completely blind to visual content. The framework implements a sophisticated multimodal architecture combining:
The system operates through an intelligent dual-stage pipeline optimized for both quality and production-ready efficiency:
U-Tube AI demonstrates significant advantages over existing solutions while maintaining comprehensive understanding:
The framework excels across diverse educational and professional content types, addressing critical gaps in existing tools:
Not every video requires frame-by-frame analysis. U-Tube AI implements a confidence-based routing engine that optimizes resource allocation:
U-Tube AI my try of a paradigm shift in YouTube content analysis by treating visual information as first-class data rather than an afterthought. The framework combines cutting-edge computer vision research with practical cost efficiency, enabling students and professionals to efficiently digest complex technical content without losing critical visual context that drives true understanding.
Making Advanced Image Segmentation Accessible to Everyone 🔍
EdgeSeg-AI is a revolutionary framework that makes advanced image segmentation accessible to everyone by introducing a novel, resource-efficient approach to prompt-based image segmentation. The framework addresses the computational limitations of cutting-edge segmentation methods by sequentially orchestrating three specialized models: Large Language Model (LLM), Fine-tuned VLM and Segment Anything Model (SAM). This innovative architecture achieves a 60-70% reduction in peak memory usage while maintaining high segmentation quality.
----------------------------------------------------------------------------------------------------------------
GitHub Repository: https://github.com/Rishab27279/EdgeSeg-AI
----------------------------------------------------------------------------------------------------------------
EdgeSeg-AI presents a unique architectural approach that sequentially orchestrates three specialized models, representing a fundamental shift from conventional approaches that load all models simultaneously:
The framework operates through a carefully designed three-stage pipeline that maximizes efficiency and accessibility:
EdgeSeg-AI demonstrates significant improvements in resource efficiency while maintaining segmentation quality:
The framework demonstrates robust performance across diverse contexts, showcasing the potential for democratizing AI-powered image analysis:
This work builds upon the foundational contributions of the LLM-Seg paper by Junchi Wang and Lei Ke from ETH Zurich, adapting their approach to prioritize computational efficiency and accessibility. The research community's insights are invaluable for advancing this work further in the following areas:
EdgeSeg-AI represents a significant advancement in making sophisticated image segmentation technology accessible to a broader audience, combining cutting-edge AI research with practical resource efficiency. The framework's innovative sequential model loading approach opens new possibilities for deploying advanced computer vision capabilities on standard consumer hardware, democratizing access to powerful AI tools across various domains and applications.
🎭 Multimodal Emotion Recognition Engine 🚀
Moody.AI is a cutting-edge multimodal AI system that analyzes emotions from video content using computer vision, audio processing, and natural language processing. Powered by state-of-the-art deep learning models including DINOv2, Wav2Vec2, DistilBERT, and Whisper, the system provides comprehensive sentiment analysis with an intuitive web interface and achieves 61% accuracy on the challenging MELD dataset.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Docker Hub: https://hub.docker.com/r/rishab27279/moody-ai | GitHub: https://github.com/Rishab27279/MoodyAI
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Moody.AI employs a sophisticated trimodal fusion architecture combining multiple AI models:
The breakthrough architecture combines all three modalities through advanced fusion techniques:
This project demonstrates the advancement of multimodal AI on edge devices, combining computer vision, natural language processing, and audio analysis into a comprehensive emotion recognition system with state-of-the-art performance.
🦴 Next-Gen Bone Diagnostics: AI-Powered, Mobile-First, Research-Driven, Powered by Deep Learning.📱
OsteoDiagnosis.AI is an innovative Android application that leverages advanced signal processing and deep learning techniques for automated bone health assessment. The app utilizes a novel, lightweight neural network architecture combining Signal Processing (Fourier) and Deep Learning to classify bone density conditions into three categories: Osteoporosis, Osteopenia, and Normal bone density. This research-driven project represents a significant advancement in mobile healthcare AI, combining our cutting-edge computer vision with clinical diagnostic applications.
----------------------------------------------------------------------------------------------------------------
APK: https://github.com/Rishab27279/OS_Detection_Binary_And_3_Class_DWT/releases/download/v1.0/app-debug.apk
----------------------------------------------------------------------------------------------------------------
This project was developed under academic supervision as part of ongoing research in medical AI diagnostics in BITS Pilani Hyderabad. While the complete technical methodology and results are currently confidential as the Research Paper is in it's last stages, the application demonstrates the successful integration of signal processing techniques with Modern deep learning architectures for bone health assessment.
The core innovation lies in the fusion of traditional signal processing methodologies with state-of-the-art deep learning approaches:
Due to the ongoing nature of this research and pending publication, specific technical details regarding the model architecture, training protocols, and validation datasets remain confidential. The methodology represents a novel contribution to the field of medical AI, particularly in bone health diagnostics.
This project demonstrates the successful application of advanced AI techniques to critical healthcare challenges, showcasing the potential for mobile-deployed deep learning solutions in clinical diagnostics. The combination of academic rigor with practical implementation highlights the intersection of research innovation and real-world healthcare applications.
🎭 Feel the Mood. Frame by Frame. Powered by Deep Learning. 🎭
Expression.AI is a real-time facial expression recognition Android application that uses deep learning to detect and classify human emotions. Powered by a custom made TensorFlow Lite model named ResInceptionCNN, the app is designed for fast, on-device inference and intuitive user experience. It combines the power of computer vision with emotion AI for mobile devices.
App APK -> https://github.com/Riiishaab/Expression.AI/releases/download/v1.0/ExpressionAI.apk
Expression.AI is built on a custom deep learning model trained for facial expression recognition:
The core of the project is the ResInceptionCNN model:
This project demonstrates the potential of deep learning on edge devices, blending AI, mobile development, and human emotion understanding into a seamless Android experience.
🤟🏼 Real-time gesture recognition through FastViT Model ✌️
This project implements real-time hand gesture recognition using Apple's FastViT architecture, leveraging transfer learning on the HaGRID dataset to achieve 97.5% accuracy while maintaining efficiency for real-time applications.
The core intelligence is provided by a hybrid vision transformer:
fastvit_t8.apple_in1k pretrained modelFastViT was chosen for its efficiency advantages over ConvNeXT, offering high accuracy in resource-constrained environments.
Trained on the Hand Gesture Recognition Image Dataset (HaGRID) 150k subset:
Option 1: Google Colab
Option 2: Inference with Pretrained Model
sign_lang_model.pkl)🚀 Intelligent Document Analysis and Summarization 🚀
Developed an AI-driven document processing system combining DeepSeek R1-1.5B for structured data extraction and Llama-7B for summarization. Achieved 92% accuracy in entity extraction and 88% ROUGE-L summarization scores, enabling efficient processing of legal and technical documents.
🎮 Where Strategic Intelligence Conquers the Classic Game 🎮
This code implements a comprehensive Checkers (Draughts) game featuring two AI players with contrasting strategies: a sophisticated Smart AI using the Minimax algorithm with alpha-beta pruning and a baseline Random AI that makes random legal moves.
The Minimax algorithm implementation represents the core intelligence of the Smart AI:
The Random AI provides a contrasting approach:
This implementation demonstrates advanced concepts in game AI development.
🔍 Advanced Facial Analysis System 🔍
Implemented a gender detection system using hybrid CNN-InceptionV3 architecture, achieving 94.35% accuracy on the CelebA dataset. Features include dynamic augmentation, adaptive learning rate scheduling, and quantized TensorFlow Lite deployment.