Research
Work grounded in what we've built and honest about what remains open.
Demonstrated Contributions
Multilingual Retrieval for Low-Resource Languages
Built and deployed a cross-lingual retrieval system supporting French, Haitian Kreyòl, English, and Spanish across 228,000+ pages of historical documents. Kreyòl is an extremely low-resource language with minimal NLP tooling — the Rasin.ai pipeline is among the first production systems to handle Kreyòl at this scale.
BGE-M3 · Cross-lingual embeddings · Qdrant vector store · Hybrid retrieval
Open Questions
- —Quantitative benchmarks for Kreyòl retrieval quality (in progress)
- —Cross-lingual query performance when query and document languages differ
GPU-Accelerated OCR for Historical Archives
A seven-stage pipeline processes handwritten records, colonial-era newspapers, and legal codes from the 16th through 20th centuries. The pipeline runs on a single self-hosted NVIDIA DGX Spark at approximately $50/month in operating costs — demonstrating that archival-scale digitization does not require enterprise cloud spend.
docTR · NVIDIA DGX Spark · GPU-accelerated batch processing · vLLM
Open Questions
- —Character error rate (CER) and word error rate (WER) benchmarks by language and century
- —Performance on severely degraded or water-damaged documents
Knowledge Graph Construction from Unstructured Historical Text
Entity extraction and deduplication across 43+ archival collections produces a knowledge graph connecting 20,000+ people, places, and events across five centuries of Haitian history. The graph enables relationship traversal that keyword and vector search alone cannot provide.
GLiNER · Neo4j · Entity deduplication · Named entity recognition across 4 languages
Open Questions
- —Formal evaluation of entity extraction precision and recall by type
- —Deduplication methodology documentation and error rate analysis
Active Directions
Sovereign Infrastructure for Community AI
Rasin.ai runs entirely on self-hosted infrastructure — no cloud dependency, no third-party data handling. The Joumou project extends this model to community-owned platforms where users hold governance rights. The research question is not just technical: what does it take for a community to actually own and sustain its own AI infrastructure?
On-premise GPU compute · vLLM · Local-first architecture · Cooperative ownership models
Evaluation Methods for Historical NLP
Standard NLP benchmarks don't exist for Haitian colonial documents. Building ground-truth evaluation datasets for OCR accuracy, entity extraction quality, and retrieval performance on these materials is itself a contribution to the field. This work is a prerequisite for the planned technical publication.
CER / WER evaluation · Human review protocols · Retrieval quality assessment · Ablation studies