Research

Work grounded in what we've built and honest about what remains open.

Demonstrated Contributions

01

Multilingual Retrieval for Low-Resource Languages

Built and deployed a cross-lingual retrieval system supporting French, Haitian Kreyòl, English, and Spanish across 228,000+ pages of historical documents. Kreyòl is an extremely low-resource language with minimal NLP tooling — the Rasin.ai pipeline is among the first production systems to handle Kreyòl at this scale.

BGE-M3 · Cross-lingual embeddings · Qdrant vector store · Hybrid retrieval

Open Questions

  • Quantitative benchmarks for Kreyòl retrieval quality (in progress)
  • Cross-lingual query performance when query and document languages differ
02

GPU-Accelerated OCR for Historical Archives

A seven-stage pipeline processes handwritten records, colonial-era newspapers, and legal codes from the 16th through 20th centuries. The pipeline runs on a single self-hosted NVIDIA DGX Spark at approximately $50/month in operating costs — demonstrating that archival-scale digitization does not require enterprise cloud spend.

docTR · NVIDIA DGX Spark · GPU-accelerated batch processing · vLLM

Open Questions

  • Character error rate (CER) and word error rate (WER) benchmarks by language and century
  • Performance on severely degraded or water-damaged documents
03

Knowledge Graph Construction from Unstructured Historical Text

Entity extraction and deduplication across 43+ archival collections produces a knowledge graph connecting 20,000+ people, places, and events across five centuries of Haitian history. The graph enables relationship traversal that keyword and vector search alone cannot provide.

GLiNER · Neo4j · Entity deduplication · Named entity recognition across 4 languages

Open Questions

  • Formal evaluation of entity extraction precision and recall by type
  • Deduplication methodology documentation and error rate analysis

Active Directions

04In Progress

Sovereign Infrastructure for Community AI

Rasin.ai runs entirely on self-hosted infrastructure — no cloud dependency, no third-party data handling. The Joumou project extends this model to community-owned platforms where users hold governance rights. The research question is not just technical: what does it take for a community to actually own and sustain its own AI infrastructure?

On-premise GPU compute · vLLM · Local-first architecture · Cooperative ownership models

05In Progress

Evaluation Methods for Historical NLP

Standard NLP benchmarks don't exist for Haitian colonial documents. Building ground-truth evaluation datasets for OCR accuracy, entity extraction quality, and retrieval performance on these materials is itself a contribution to the field. This work is a prerequisite for the planned technical publication.

CER / WER evaluation · Human review protocols · Retrieval quality assessment · Ablation studies