Sentiment Analysis Web App

Husan

reading

writing

studio

projects

Husan

← projects

Briton Riviere — Daniel in the Lions' Den, 1872.

Sentiment Analysis Web App

March 1, 2025·Python · Flask · NumPy · TF-IDF

A Flask web app that classifies IMDB movie reviews with a NumPy logistic regression model and compares results against scikit-learn.

GitHub

Description

Context

This is the first project in my Vibecoding Ascent challenge and was designed as a full sentiment analysis pipeline for movie reviews, from preprocessing to model training to browser-based inference.

Instead of relying only on framework abstractions, I built the core model logic myself to understand the mechanics of text classification at a first-principles level.

Problem and Objective

The goal was to classify IMDB reviews as positive or negative while comparing a custom implementation against a standard library baseline.

I wanted to answer two questions:

•

Can a NumPy-only logistic regression implementation reach competitive performance?

•

Which parts of the text pipeline matter most for practical sentiment quality?

Modeling Approach

The primary model is logistic regression implemented from scratch in NumPy and optimized with gradient descent.

•

Loss function: binary cross-entropy

•

Training iterations: 1,000

•

Learning rate: 0.1

•

Output: binary sentiment prediction

Building the optimizer manually made it easier to inspect training behavior and understand how parameter updates respond to sparse features.

Feature Engineering Pipeline

Review text is transformed into TF-IDF vectors with both unigrams and bigrams, capped at 5,000 features.

•

Unigrams capture broad sentiment-bearing words

•

Bigrams capture short phrase context that can flip polarity

•

Preprocessing/cleaning modules keep train-time and inference-time transformations consistent

This setup balances representation quality with computational efficiency.

Benchmarking and Performance

To validate the custom implementation, I trained a parallel scikit-learn baseline on the same data setup and compared outcomes directly.

•

Dataset: IMDB movie reviews (50,000 total)

•

Custom NumPy logistic regression: approximately 85-88% validation accuracy

•

scikit-learn comparison model: similar accuracy range

The comparison showed that the custom model remains competitive while being fully transparent and inspectable.

Productization and Tooling

The project is wrapped in a Flask web app so users can input any review and receive an immediate sentiment prediction.

Supporting scripts handle the full workflow:

•

Model training

•

Feature extraction and visualization generation

•

Prediction testing

This made experimentation repeatable and reduced friction between model iteration and UI-facing inference.

What This Project Demonstrates

This project demonstrates practical ML engineering across data preprocessing, feature design, optimization, benchmarking, and deployment integration.

More importantly, it built strong intuition for linear models and NLP feature spaces, which provides a grounded foundation for later work with larger frameworks and deep learning systems.

Papers Read

•

TF-IDF Feature Engineering for Text Classification

Used TF-IDF vectorization with unigrams and bigrams to represent IMDB reviews as sparse numeric features, capped at 5,000 dimensions.

•

Logistic Regression via Gradient Descent

Built a binary classifier from scratch in NumPy with cross-entropy optimization (learning rate 0.1, 1,000 iterations) and compared it against scikit-learn.