VersoVector is an emotional-semantic NLP and MLOps project for poetry and lyrical language analysis.
The project starts with poetry as a dense form of emotional and symbolic language, using César Vallejo and English poetry corpora as a key reference point. Its long-term goal is to evolve into a mood-aware recommendation engine capable of mapping poems, lyric-like text, and user-provided fragments into interpretable affective spaces.
VersoVector combines:
- supervised multilabel tag prediction;
- semantic similarity;
- topic modeling;
- clustering;
- visual interpretation;
- model packaging;
- inference through a reusable Python package;
- FastAPI serving;
- a Gradio frontend product experience;
- Docker-based local services;
- testing;
- a sanitized Google Cloud deployment blueprint.
This public repository is designed as a portfolio-grade implementation. It demonstrates the technical and product direction of the system while intentionally excluding production-only assets such as real cloud project identifiers, secrets, billing configuration, private datasets, authentication logic, premium access control, analytics pipelines, and production infrastructure state.
The target product is an emotional-semantic recommendation API and user-facing analysis experience.
Given a poem, lyric-like text, or short user-provided fragment, the system can return:
- predicted emotional or thematic tags;
- semantically similar poems or texts;
- dominant topic information;
- cluster assignment;
- explainable similarity signals;
- optional visualization metadata.
The long-term vision is to support mood-aware discovery for poetic and lyrical content.
For copyrighted lyrics, production usage should rely only on licensed, public-domain, metadata-based, short-excerpt, or user-provided content. Public demos should avoid storing or redistributing full copyrighted lyrics unless properly licensed.
VersoVector starts with poetry because poetry provides dense emotional and symbolic language.
The long-term vision is to evolve the system into an emotional-semantic recommendation engine for poetic and lyrical content.
The goal is not only to recommend similar texts, but to explain why they feel emotionally close:
- shared moods;
- symbolic intensity;
- semantic neighborhoods;
- dominant topics;
- affective tags;
- interpretability signals.
A future version could support mood-aware discovery for music, lyrics, journaling, education, literary exploration, libraries, editorial platforms, ecommerce catalog enrichment, and creative recommendation systems.
The public repository includes a sanitized Google Cloud deployment blueprint.
This blueprint shows how VersoVector could be deployed as a containerized emotional-semantic recommendation system using Cloud Run, Artifact Registry, Terraform, Docker services, and optional MLflow or remote artifact storage.
The diagram separates the public portfolio architecture from the private production-only layer.
Public repository scope:
architecture
demo frontend
FastAPI backend
Docker services
tests
documentation
sanitized Terraform blueprint
Private production scope:
real GCP projects
real IAM bindings
secrets
billing
authentication
premium features
analytics
production monitoring
private datasets
deployment state
The Terraform blueprint lives under:
infra/gcp-cloud-run-blueprint/
It is intentionally generic and does not include real production values.
The current analytical pipeline is organized as six reproducible notebooks:
| Notebook | Purpose |
|---|---|
01_cleaning_pipeline.ipynb |
Cleans raw datasets, normalizes metadata, creates poem_id, and generates poems_processed.csv. |
02_feature_pipeline.ipynb |
Fits the shared feature pipeline on the reference corpus and transforms external poems. |
03_embeddings_supervised.ipynb |
Trains the supervised multilabel tag prediction model. |
04_embeddings_unsupervised.ipynb |
Generates unsupervised artifacts: similarity, topics, clustering, and 2D projections. |
05_supervised_unsupervised_integration.ipynb |
Integrates supervised predictions with unsupervised outputs. |
06_visualizations.ipynb |
Generates final visualizations from integration artifacts. |
The project documentation is split by layer:
docs/model_topology.md: conceptual model topology, modeling rationale, and feature flow.notebook/README.md: notebook-first analytical guide for the exploratory pipeline.src/versovector/README.md: Python package guide for training, inference, and API serving.apps/README.md: application-layer documentation for user-facing apps.apps/frontend/README.md: Gradio frontend documentation and local execution guide.services/README.md: local services overview and Docker Compose guide.services/api/README.md: FastAPI service packaging and local Docker execution.services/frontend/README.md: frontend service packaging and Docker execution.tests/README.md: unit and integration testing strategy.infra/gcp-cloud-run-blueprint/README.md: sanitized Cloud Run deployment blueprint.
VersoVector/
├── apps/
│ └── frontend/ # Gradio user-facing demo application
├── artifacts/ # Generated artifacts, mostly ignored by Git
├── configs/ # Project and model configuration
├── data/ # Raw and processed local datasets
├── docs/ # Technical documentation and model topology
├── figs/ # Figures used by notebooks and README
├── infra/
│ └── gcp-cloud-run-blueprint/ # Sanitized Terraform blueprint for public portfolio use
├── modules/ # Reusable analytical project modules
│ ├── classification/
│ ├── clustering/
│ ├── evaluation/
│ ├── features/
│ ├── integration/
│ ├── io/
│ └── preprocessing/
├── notebook/ # Analytical notebooks 01–06
├── services/
│ ├── api/ # FastAPI Docker/service packaging
│ ├── frontend/ # Gradio Docker/service packaging
│ └── compose.yaml # Local API + frontend orchestration
├── src/
│ └── versovector/
│ ├── training/ # Scripted training pipeline
│ ├── inference/ # Model bundle loading and analysis logic
│ └── api/ # FastAPI serving layer
├── tests/
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
├── utils/ # Shared constants and utility helpers
├── requirements.txt
├── requirements-dev.txt
├── requirements-test.txt
├── requirements-mlops.txt
├── requirements-api.txt
├── requirements-frontend.txt
├── requirements-umap.txt
├── pyproject.toml
├── tox.ini
└── README.md
VersoVector combines supervised and unsupervised NLP methods.
The supervised branch predicts multilabel poetic tags using a sparse normalized feature representation and a multilabel classifier.
The strongest current candidate is:
OneVsRestClassifier
└── StackingClassifier
├── ComplementNB
├── MultinomialNB
└── LogisticRegression final estimator
The sparse pipeline avoids converting the full feature matrix to a dense representation, which significantly improves runtime and memory usage.
The unsupervised branch generates:
- cosine similarity recommendations;
- nearest-neighbor search;
- LDA topic modeling;
- KMeans and GMM cluster assignments;
- exploratory Agglomerative and DBSCAN clusters;
- 2D UMAP or t-SNE projections.
For clustering, the pipeline avoids densifying the full sparse matrix. Dimensionality reduction is applied first, then clustering is performed on the reduced dense representation.
The project separates the corpus into two roles:
| Role | Meaning |
|---|---|
reference |
Poetry Foundation corpus used to fit feature and model spaces. |
external |
External poets projected into the learned space, such as César Vallejo or future poets. |
This avoids fitting the vector space directly on external poems and makes extrapolation clearer.
Every poem receives a unique technical identifier:
poem_id
This avoids joining outputs by non-unique fields such as title, poet, source, or corpus role.
The main feature pipeline is sparse and normalized:
build_feature_pipeline(
input_is_processed=True,
to_dense=False,
normalize=True,
)Dense matrices are only created after dimensionality reduction when required by downstream algorithms.
The serving layer does not retrain models.
It consumes a generated model bundle:
artifacts/model_bundle/
This keeps training, packaging, inference, API serving, and frontend usage clearly separated.
The notebook pipeline is the analytical foundation. The MLOps workflow turns that foundation into reproducible scripts and deployable artifacts.
Notebook logic is exposed through reproducible scripts:
src/versovector/training/
├── build_dataset.py
├── train_features.py
├── train_supervised.py
├── train_unsupervised.py
├── register_model.py
└── mlflow_utils.py
The scripts are mapped to the notebook pipeline:
| Script | Notebook equivalent | Purpose |
|---|---|---|
build_dataset.py |
01_cleaning_pipeline.ipynb |
Builds the processed corpus. |
train_features.py |
02_feature_pipeline.ipynb |
Fits and serializes the shared feature pipeline. |
train_supervised.py |
03_embeddings_supervised.ipynb |
Trains the supervised multilabel classifier. |
train_unsupervised.py |
04_embeddings_unsupervised.ipynb |
Generates unsupervised artifacts. |
register_model.py |
Model packaging step | Builds artifacts/model_bundle/. |
MLflow is used to track:
- parameters;
- metrics;
- model artifacts;
- feature configuration;
- dataset versions;
- model versions.
MLflow support is optional and controlled by:
[mlflow]
enabled = trueIf MLflow is disabled, the scripts can run without MLflow installed.
The trained system is packaged as a model bundle:
artifacts/model_bundle/
├── model_config.toml
├── model_metadata.json
├── feature_pipeline.joblib
├── supervised_classifier.joblib
├── multilabel_binarizer.joblib
├── nearest_neighbors.joblib
├── reference_metadata.csv
├── lda_model.joblib
├── lda_count_vectorizer.joblib
├── dimensionality_reducer.joblib
├── kmeans_model.joblib
├── gmm_model.joblib
├── lda_topics.csv
├── supervised_metrics.csv
├── unsupervised_metadata.json
└── unsupervised_results.csv
The first deployable product is a single PoemAnalyzer bundle that can return tags, similar poems, topics, and clusters.
The inference layer loads a generated model bundle and exposes reusable Python components:
src/versovector/inference/
├── artifact_loader.py
├── poem_analyzer.py
├── schemas.py
├── similarity_search.py
├── tag_predictor.py
└── topic_clusterer.py
Example local usage:
from versovector.inference import PoemAnalyzer
analyzer = PoemAnalyzer.from_bundle("artifacts/model_bundle")
result = analyzer.analyze_dict(
title="test poem",
poet="anonymous",
poem="I walk through the rain carrying a memory of light.",
top_k_tags=5,
top_n_similar=5,
)
print(result)The API layer exposes the inference package through FastAPI:
src/versovector/api/
├── dependencies.py
├── main.py
├── schemas.py
└── settings.py
Current endpoints:
GET /
GET /health
GET /ready
GET /v1/model-info
POST /v1/analyze
POST /v1/predict-tags
POST /v1/similar
Example request:
{
"title": "Untitled",
"poet": "anonymous",
"poem": "I walk through the rain carrying a memory...",
"user_tags": ["memory", "sadness"],
"top_k_tags": 5,
"top_n_similar": 5
}Example response shape:
{
"predicted_tags": [
{"tag": "sadness", "score": 0.82},
{"tag": "memory", "score": 0.71}
],
"similar_poems": [
{
"title": "black herald",
"poet": "cesar vallejo",
"score": 0.76
}
],
"topic": {
"topic_id": 3,
"terms": "death, night, heart, god, pain"
},
"cluster": {
"kmeans": 2,
"gmm": 1
}
}The project includes a separate testing layer for unit and integration validation.
The test suite is intended to validate:
- preprocessing utilities;
- artifact loading helpers;
- inference schemas;
- tag prediction logic;
- similarity search formatting;
- model bundle loading;
PoemAnalyzerend-to-end execution;- FastAPI endpoints through
TestClient.
VersoVector includes a packaging and local services foundation.
This layer separates Python package code, user-facing applications, and service-level deployment assets:
apps/
└── frontend/
services/
├── api/
├── frontend/
└── compose.yaml
The API and frontend can run locally through Docker Compose.
The first frontend is implemented with Gradio because it is well suited for interactive AI demos.
The frontend allows users to:
- submit a poem or lyric-like fragment;
- select suggested tags;
- request emotional or thematic tag prediction;
- retrieve semantically similar poems;
- inspect topic and cluster information;
- access a secondary API and developer guide.
The frontend does not load model artifacts directly. It communicates with the FastAPI backend.
The public repository includes a sanitized Terraform blueprint under:
infra/gcp-cloud-run-blueprint/
It models the following public portfolio architecture:
Users
↓
Gradio Frontend
Cloud Run
↓
FastAPI API
Cloud Run
↓
PoemAnalyzer
↓
Model Bundle / Reference Metadata
Supporting infrastructure:
Artifact Registry
Cloud Run services
runtime service accounts
public frontend access
private API access from frontend
optional remote model artifact store
The blueprint is not a production deployment.
It intentionally avoids:
real project IDs
real service account keys
real backend state
real IAM policies
real secrets
real billing setup
real domains
private datasets
premium access logic
production monitoring configuration
A production implementation should live in a private repository, for example:
VersoVector-Platform
That private repository would contain real environments such as:
infra/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
├── backend.tf
├── terraform.tfvars
├── iam/
├── secrets/
├── cloud-run/
├── artifact-registry/
└── monitoring/
GitHub Actions will be used to automate:
- linting;
- unit tests;
- integration tests;
- package builds;
- Docker image builds;
- model validation;
- optional deployment workflows.
The public repository may include CI/CD examples or non-sensitive workflows. Production deployment automation should remain private when it involves real cloud projects, service accounts, secrets, billing, private environments, or production access rules.
VersoVector includes an initial packaging and local services foundation.
This layer prepares the project for containerized execution by separating Python package code, user-facing applications, and service-level deployment assets.
apps/
└── frontend/
├── README.md
├── app.py
├── client.py
├── tag_catalog.py
├── ui_formatters.py
└── assets/
services/
├── README.md
├── api/
│ ├── Dockerfile
│ ├── README.md
│ └── .dockerignore
├── frontend/
│ ├── Dockerfile
│ ├── README.md
│ └── .dockerignore
└── compose.yaml
The project uses pyproject.toml as the package metadata and build configuration source.
A local wheel can be generated with:
python -m buildThe generated files are written to:
dist/
├── versovector-<version>-py3-none-any.whl
└── versovector-<version>.tar.gz
The dist/ directory is a generated build output and should not be committed to Git.
The wheel packages project code, including:
src/versovector/
modules/
utils/
It does not include generated model artifacts such as:
artifacts/model_bundle/
*.joblib
mlruns/
mlflow.db
The services layer currently defines two local services:
api: FastAPI backend powered byPoemAnalyzer.frontend: Gradio demo app that calls the API over HTTP.
Run both services locally with Docker Compose:
docker compose -f services/compose.yaml up --buildLocal URLs:
API:
http://localhost:8001
API docs:
http://localhost:8001/docs
Frontend:
http://localhost:7860
In this foundation stage, the API image does not bake model artifacts into the container.
Instead, the local model bundle is mounted as a read-only volume:
artifacts/model_bundle:/app/artifacts/model_bundle:ro
This keeps code packaging and model artifact management separated.
For future Cloud Run deployment, the model bundle should be retrieved from a controlled artifact location such as Google Cloud Storage, Artifact Registry, or an MLflow artifact store.
The infra/gcp-cloud-run-blueprint/ directory contains a sanitized Terraform blueprint for a possible Google Cloud deployment.
The blueprint includes conceptual resources for:
- Artifact Registry;
- Cloud Run API service;
- Cloud Run frontend service;
- runtime service accounts;
- public frontend invoker access;
- private API access from the frontend service account;
- optional remote model artifact store configuration.
This layer is intended to show how the system could be deployed without exposing real production infrastructure.
Basic Terraform workflow:
cd infra/gcp-cloud-run-blueprint
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform fmt -recursive
terraform validate
terraform plan -var-file=terraform.tfvarsDo not commit terraform.tfvars, .terraform/, or *.tfstate files.
Create a virtual environment:
python3.10 -m venv .venv
source .venv/bin/activateInstall core dependencies:
pip install --upgrade pip
pip install -r requirements.txtInstall development dependencies:
pip install -r requirements-dev.txtInstall test dependencies:
pip install -r requirements-test.txtInstall MLOps dependencies:
pip install -r requirements-mlops.txtInstall API dependencies:
pip install -r requirements-api.txtInstall frontend dependencies:
pip install -r requirements-frontend.txtInstall optional UMAP dependencies:
pip install -r requirements-umap.txtDownload the spaCy model:
python -m spacy download en_core_web_lgStart Jupyter:
jupyter notebookRun notebooks in order:
01_cleaning_pipeline.ipynb
02_feature_pipeline.ipynb
03_embeddings_supervised.ipynb
04_embeddings_unsupervised.ipynb
05_supervised_unsupervised_integration.ipynb
06_visualizations.ipynb
Generated files are stored under:
artifacts/
figs/
data/
Large local artifacts are not intended to be committed to Git.
Run from the repository root:
PYTHONPATH=src:. python -m versovector.training.build_dataset \
--config configs/model_config.toml && \
PYTHONPATH=src:. python -m versovector.training.train_features \
--config configs/model_config.toml && \
PYTHONPATH=src:. python -m versovector.training.train_supervised \
--config configs/model_config.toml && \
PYTHONPATH=src:. python -m versovector.training.train_unsupervised \
--config configs/model_config.toml && \
PYTHONPATH=src:. python -m versovector.training.register_model \
--config configs/model_config.tomlPYTHONPATH=src:. allows Python to resolve both:
src/versovector/
modules/
utils/
This is useful during the transition phase while reusable analytical logic still lives under modules/.
Start the API:
PYTHONPATH=src:. uvicorn versovector.api.main:app \
--host 0.0.0.0 \
--port 8001 \
--reloadHealth check:
curl http://localhost:8001/health | jq .Readiness check:
curl http://localhost:8001/ready | jq .Model information:
curl http://localhost:8001/v1/model-info | jq .Full analysis:
curl -X POST http://localhost:8001/v1/analyze \
-H "Content-Type: application/json" \
-d '{
"title": "test poem",
"poet": "anonymous",
"poem": "I walk through the rain carrying a memory of light.",
"top_k_tags": 5,
"top_n_similar": 5
}' | jq .Tag prediction:
curl -X POST http://localhost:8001/v1/predict-tags \
-H "Content-Type: application/json" \
-d '{
"title": "test poem",
"poet": "anonymous",
"poem": "I walk through the rain carrying a memory of light.",
"top_k_tags": 5
}' | jq .Similarity search:
curl -X POST http://localhost:8001/v1/similar \
-H "Content-Type: application/json" \
-d '{
"title": "test poem",
"poet": "anonymous",
"poem": "I walk through the rain carrying a memory of light.",
"top_n_similar": 5
}' | jq .Start the API first:
PYTHONPATH=src:. uvicorn versovector.api.main:app \
--host 0.0.0.0 \
--port 8001 \
--reloadWarm up the model bundle:
curl http://localhost:8001/ready | jq .Run the frontend:
VERSOVECTOR_API_BASE_URL=http://localhost:8001 \
VERSOVECTOR_API_TIMEOUT_SECONDS=300 \
PORT=7860 \
python apps/frontend/app.pyOpen:
http://localhost:7860
Run both API and frontend:
docker compose -f services/compose.yaml up --buildLocal URLs:
API:
http://localhost:8001
API docs:
http://localhost:8001/docs
Frontend:
http://localhost:7860
Testing is a separate quality layer.
Recommended structure:
tests/
├── unit/
│ ├── test_preprocessing.py
│ ├── test_artifact_store.py
│ ├── test_inference_schemas.py
│ ├── test_similarity_search.py
│ └── test_tag_predictor.py
└── integration/
├── test_model_bundle_loading.py
├── test_poem_analyzer_end_to_end.py
└── test_api_endpoints.py
Unit tests validate isolated Python components without requiring a full model bundle.
Examples:
- text cleaning and tag parsing;
- artifact path helpers;
- schema serialization;
- similarity result formatting;
- tag ranking logic.
Integration tests validate that multiple layers work together.
Examples:
- loading a minimal test model bundle;
- running
PoemAnalyzer.analyze_dict; - calling FastAPI endpoints with
TestClient; - validating API response shapes.
The services/ directory is for deployment packaging and service-level assets.
The tests/integration/ directory is for validating interactions between components.
They are related, but not the same thing:
services/
How the API or frontend is packaged and deployed.
tests/integration/
Whether components work together correctly.
Run unit tests:
PYTHONPATH=src:. pytest tests/unitRun integration tests:
PYTHONPATH=src:. pytest tests/integrationRun coverage:
PYTHONPATH=src:. pytest --cov=src --cov=modules tests/Run through tox:
tox -e unit
tox -e integration
tox -e coverage
tox -e buildThe project separates dependencies by purpose:
requirements.txt
Core runtime dependencies.
requirements-dev.txt
Development dependencies: notebooks, linting, formatting, packaging, and local development tools.
requirements-test.txt
Test-only dependencies for unit and integration tests.
requirements-mlops.txt
MLflow and model tracking dependencies.
requirements-api.txt
FastAPI serving dependencies.
requirements-frontend.txt
Gradio frontend/demo dependencies.
requirements-umap.txt
Optional UMAP dependencies.
The repository does not version large generated artifacts such as:
*.joblib
*.pkl
*.pickle
*.npy
*.npz
Ignored generated paths include:
artifacts/features/*
artifacts/supervised/*
artifacts/unsupervised/*
artifacts/integration/*
artifacts/model_bundle/*
mlruns/
mlflow.db
dist/
build/
*.egg-info/
These artifacts are regenerated locally from notebooks or training scripts.
A production MLOps version should store model artifacts in a dedicated artifact store, such as MLflow artifacts, Google Cloud Storage, or another controlled registry.
The public repository may include sanitized infrastructure blueprints.
It should not include:
real project IDs
real terraform.tfvars files
Terraform state
service account keys
secrets
private IAM bindings
billing configuration
production domains
premium access rules
customer data
private datasets
The production infrastructure should live in a private repository or private environment-specific configuration.
The following figures are intentionally kept in the repository:
figs/vallejo_tfidf_vectors.png
figs/poemas_2d_umap_clustering_kmeans.png
figs/VersoVector-GCP-Architecture.png
figs/vallejo_tfidf_vectors.png is used in the project documentation.
figs/poemas_2d_umap_clustering_kmeans.png is kept as a visual reference for the organic UMAP structure observed during exploration.
figs/VersoVector-GCP-Architecture.png documents the sanitized public GCP blueprint for the project.
Additional figures generated by 06_visualizations.ipynb may include:
figs/poemas_2d_integracion.png
figs/vallejo_projection.png
figs/top_tags_by_cluster.png
figs/cluster_source_matrix.png
figs/cluster_topic_matrix.png
Completed:
- modular notebook pipeline;
- reusable feature, classification, clustering, evaluation, I/O, preprocessing, and integration modules;
- supervised multilabel classifier;
- unsupervised similarity, topic, and clustering outputs;
- integration artifacts;
- final visualizations;
- script-based training layer;
- optional MLflow helper utilities;
- model bundle packaging;
- inference package through
PoemAnalyzer; - FastAPI serving layer;
- unit and integration testing foundation;
- Python package build foundation;
- Docker-based local services foundation;
- Gradio frontend product experience foundation;
- sanitized Cloud Run deployment blueprint.
Next:
- refine GitHub Actions for linting, tests, package build, and image build;
- define a private production infrastructure repository;
- move real Terraform environments and secrets to a private repo;
- prepare a controlled model artifact strategy using GCS, MLflow artifacts, or Artifact Registry;
- add production observability and analytics outside the public portfolio repo;
- evaluate authentication, billing, and premium features in a private product track.
The base .gitignore was generated from gitignore.io with the filters python, macos, and windows.
Additional project-specific ignore rules exclude generated datasets, model artifacts, MLflow local files, Terraform local state, local package build outputs, and large binary files.
- Hubert Ronald - Initial Work - HubertRonald
- See also the list of contributors who participated in this project.
The source code in this repository is distributed under the MIT License. See the LICENSE file for more details.
The repository may reference public-domain poems, open datasets, translations, or third-party textual materials for experimentation. These materials may be subject to their own licenses, terms of use, or copyright restrictions and are not automatically covered by the MIT License of this codebase.
For production, commercial deployment, hosted inference, or redistribution of trained model artifacts, dataset and text usage rights should be reviewed separately. A production version of VersoVector should rely only on public-domain, properly licensed, or otherwise cleared textual sources.
