Ashitha P
This project performs text extraction and natural language processing (NLP) on articles obtained from the Blackcoffer Insights website.
The program reads saved HTML article files, extracts the title and article content, and performs sentiment analysis and readability analysis using a predefined Master Dictionary and StopWords list.
The final output calculates multiple text analysis variables such as:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Percentage of Complex Words
- Fog Index
- Complex Word Count
- Word Count
- Syllable Per Word
- Personal Pronouns
- Average Word Length
The results are saved in the required output format specified in the assignment.
The solution follows these main steps:
- Reads article URLs and IDs from
Input.xlsx - Reads output structure from
Output Data Structure.xlsx
Stopwords are loaded from:
- Custom stopwords provided in the StopWords folder
- NLTK English stopwords
Both are combined to create a complete stopword list.
Positive and negative word lists are loaded from:
MasterDictionary/
├── positive-words.txt
└── negative-words.txt
Stopwords are removed from these lists to improve sentiment accuracy.
The script extracts:
- Article Title
- Main Content
From saved HTML files located in:
Saved_HTML/
The parser searches multiple possible containers such as:
td-post-contentarticlepost-bodycontent
This ensures robust extraction even if page structures vary.
The article text is processed by:
- Converting to lowercase
- Removing punctuation
- Tokenizing words using NLTK
- Removing stopwords
Using the Master Dictionary, the program calculates:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
The following readability metrics are computed:
- Average Sentence Length
- Percentage of Complex Words
- Fog Index
- Syllables Per Word
- Complex Word Count
- Word Count
- Average Word Length
- Personal Pronouns
The results are saved in:
Output.xlsx
Matching the format specified in Output Data Structure.xlsx.
Any missing or unreadable HTML files are logged in:
missing_files.txt
Blackcoffer_Assignment
│
├── Input.xlsx
├── Output Data Structure.xlsx
├── Output.xlsx
│
├── Saved_HTML
│ ├── 123.html
│ ├── 124.html
│ └── ...
│
├── StopWords
│ ├── StopWords_Auditor.txt
│ ├── StopWords_Currencies.txt
│ └── ...
│
├── MasterDictionary
│ ├── positive-words.txt
│ └── negative-words.txt
│
├── missing_files.txt
├── analysis.py
└── README.md
- Python
- Pandas
- BeautifulSoup
- NLTK
- Regular Expressions
Install dependencies using:
pip install pandas beautifulsoup4 nltk openpyxlNLTK datasets are downloaded automatically by the script:
punkt
stopwords
- Clone the repository
git clone https://github.com/your-username/blackcoffer-nlp-analysis.git
- Navigate to the project folder
cd blackcoffer-nlp-analysis
- Run the script
python analysis.py
- Output will be generated as:
Output.xlsx
- The script processes locally saved HTML files instead of live scraping.
- This approach ensures stability in case of SSL issues or website restrictions.