Spaces:
Running
Running
LexiMind Architecture
Overview
LexiMind couples a from-scratch Transformer implementation with a modern data and inference stack. The project consists of three major layers:
- Data & Preprocessing β lightweight text cleaning built on top of scikit-learn primitives and a Hugging Face tokenizer wrapper with deterministic batching helpers.
- Model Composition β the bespoke encoder/decoder stack with task heads assembled via
MultiTaskModel, plusmodels.factory.build_multitask_modelto rebuild the network from configuration files. - Inference & Serving β a multi-task pipeline capable of summarization, emotion, and topic classification; surfaced through a CLI and FastAPI service with plans for a Gradio UI.
Custom Transformer Stack
src/models/encoder.pyandsrc/models/decoder.pyimplement Pre-LayerNorm Transformer blocks with explicit positional encoding, masking logic, and incremental decoding support.src/models/heads.pyprovides modular output heads. Summarization uses anLMHeadtied to the decoder embedding weights; emotion and topic tasks useClassificationHeadinstances.src/models/multitask.pyroutes inputs to the correct head, computes task-specific losses, and exposes a single forward API used by the trainer and inference pipeline.src/models/factory.pyrebuilds the encoder, decoder, and heads directly from YAML config and tokenizer metadata so inference rebuilds the exact architecture used in training.
Data, Tokenization, and Preprocessing
src/data/tokenization.pywrapsAutoTokenizerto provide tensor-aware batching and helper utilities for decoder input shifting, BOS/EOS resolution, and vocab size retrieval.src/data/preprocessing.pyintroducesTextPreprocessor, layering aBasicTextCleanerwith optional scikit-learn transformers (viasklearn_transformer) before tokenization. This keeps the default cleaning minimal while allowing future reuse ofsklearn.preprocessingutilities without changing calling code.src/data/dataset.pyandsrc/data/dataloader.pydefine strongly typed dataset containers and collators that encode inputs with the shared tokenizer and set up task-specific labels (multi-label emotions, categorical topics, seq2seq summaries).
Training Pipeline
src/training/trainer.pycoordinates multi-task optimization with per-task loss functions, gradient clipping, and shared tokenizer decoding for metric computation.- Metrics in
src/training/metrics.pyinclude accuracy, multi-label F1, and a ROUGE-like overlap score for summarization. These metrics mirror the trainer outputs logged per task. - Label vocabularies are serialized to
artifacts/labels.jsonafter training so inference can decode class indices consistently.
Inference & Serving
src/inference/pipeline.pyexposes summarization, emotion, and topic predictions with shared pre-processing, generation, and thresholding logic. It expects label vocabularies from the serialized metadata file.src/inference/factory.pyrebuilds the full pipeline by loading the tokenizer (preferring the exported tokenizer artifact), reconstructing the model via the factory helpers, restoring checkpoints, and injecting label metadata.- The CLI (
scripts/inference.py) drives the pipeline from the command line. The FastAPI app (src/api/routes.py) exposes the/summarizeendpoint that returns summaries, emotion labels + scores, and topic predictions. Test coverage intests/test_inferenceandtests/test_apivalidates both layers with lightweight stubs.
Gradio UI Roadmap
- The inference pipeline returns structured outputs that are already suitable for a web UI.
- Planned steps for a Gradio demo:
- Wrap
InferencePipeline.batch_predictinside Gradio callbacks for text input. - Display summaries alongside emotion tag chips and topic confidence bars.
- Surface token-level attention visualizations by extending the pipeline to emit decoder attention maps (hooks already exist in the decoder).
- Wrap
- Documentation and code paths were structured to keep the Gradio integration isolated in a future
src/ui/gradio_app.pymodule without altering core logic.
Key Decisions
- Custom Transformer Preservation β all modeling remains on the bespoke encoder/decoder, satisfying the constraint to avoid Hugging Face model classes while still leveraging their tokenizer implementation.
- Tokenizer Artifact Preference β inference automatically favors the exported tokenizer in
artifacts/hf_tokenizer, guaranteeing consistent vocabularies between training and serving. - Sklearn-friendly Preprocessing β the text preprocessor now accepts an optional
TransformerMixinso additional normalization (lemmatization, custom token filters, etc.) can be injected using familiar scikit-learn tooling without rewriting the batching code. - Documentation Alignment β the
docs/folder mirrors the structure requested, capturing design reasoning and paving the way for future diagrams indocs/images.