SentiSum: Take-Home Assignment - Multilabel Document Categorization — Take-Home Assignments

In customer reviews, there are numerous insights we call "topics." A topic essentially describes the concise meaning or theme within a piece of text.

“One tyre went missing, so there was a delay to get the two tyres fitted. The garage I dealt with were fantastic.”

Topics: (incorrect tyres) (garage service) (wait time)

The main essence of this task is to learn a supervised, multi-label topic classifier from a set of raw, unlabelled documents. To guide this process, we provide a list of pre-defined target topics and their descriptions.

A Note on the Right Approach

Given the current hype, it might be tempting to use a large, general-purpose LLM (like GPT-4) to directly classify the documents. Please avoid this. The goal of this assignment is to demonstrate your ability to build a bespoke NLP pipeline from scratch, leveraging unsupervised techniques to create training data for a supervised model—a powerful real-world skill.

The dataset is provided via a Google Drive link: Dataset Link. It contains:

A CSV file with one document (customer review) per entry.
A TXT file containing the set of "Provided Topics" and their descriptions, which you will use as your target labels.

Sub-task 1: Unsupervised Topic Modeling

"Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents."

Your first task is to churn out topic clusters based on the raw data. The more granular and closer these topic clusters are to the "Provided Topics," the better. You are free to go to any extent to refine the topic clusters to ensure they are more aligned with the set of Provided Topics.

Sub-task 2: Learning a Supervised Multi-Topic Classifier

"Topic classification is a 'supervised' machine learning technique, one that needs training before being able to automatically analyze texts."

Based on the Relevant Topic Clusters identified in Sub-task 1, you will programmatically annotate the documents. Using this weakly-labeled dataset, you need to train a Supervised Classifier that can label any new document with the set of topics that have been identified.

PLEASE NOTE: Your classifier will only be able to predict topics for which you were able to generate labels in Sub-task 1. No manual annotation is required.

Your submission will be evaluated on the following:

Data Analysis & EDA: Insights derived from the dataset.
Approach: The intuition (linguistic and statistical) and technical motivation behind your chosen pipeline, feature engineering, and preprocessing steps.
Code Quality: Adherence to good coding practices, modularity, and error handling.
Results: The evaluation methods used for your final classifier and an analysis of its time and complexity.

All code and resources (a link to a Git repository is sufficient).
A summary and explanation of your approach, its shortcomings, and ideas for improvements. This is as important as the execution.