HTML Article Text Analysis using NLP

Author

Ashitha P

Project Description

This project performs text extraction and natural language processing (NLP) on articles obtained from the Blackcoffer Insights website.

The program reads saved HTML article files, extracts the title and article content, and performs sentiment analysis and readability analysis using a predefined Master Dictionary and StopWords list.

The final output calculates multiple text analysis variables such as:

Positive Score
Negative Score
Polarity Score
Subjectivity Score
Average Sentence Length
Percentage of Complex Words
Fog Index
Complex Word Count
Word Count
Syllable Per Word
Personal Pronouns
Average Word Length

The results are saved in the required output format specified in the assignment.

Project Workflow

The solution follows these main steps:

1. Input Loading

Reads article URLs and IDs from Input.xlsx
Reads output structure from Output Data Structure.xlsx

2. Stopwords Loading

Stopwords are loaded from:

Custom stopwords provided in the StopWords folder
NLTK English stopwords

Both are combined to create a complete stopword list.

3. Master Dictionary Loading

Positive and negative word lists are loaded from:

MasterDictionary/
 ├── positive-words.txt
 └── negative-words.txt

Stopwords are removed from these lists to improve sentiment accuracy.

4. HTML Article Extraction

The script extracts:

Article Title
Main Content

From saved HTML files located in:

Saved_HTML/

The parser searches multiple possible containers such as:

td-post-content
article
post-body
content

This ensures robust extraction even if page structures vary.

5. Text Cleaning

The article text is processed by:

Converting to lowercase
Removing punctuation
Tokenizing words using NLTK
Removing stopwords

6. Sentiment Analysis

Using the Master Dictionary, the program calculates:

Positive Score
Negative Score
Polarity Score
Subjectivity Score

7. Readability & Complexity Metrics

The following readability metrics are computed:

Average Sentence Length
Percentage of Complex Words
Fog Index
Syllables Per Word
Complex Word Count
Word Count
Average Word Length
Personal Pronouns

8. Output Generation

The results are saved in:

Output.xlsx

Matching the format specified in Output Data Structure.xlsx.

Any missing or unreadable HTML files are logged in:

missing_files.txt

Project Structure

Blackcoffer_Assignment
│
├── Input.xlsx
├── Output Data Structure.xlsx
├── Output.xlsx
│
├── Saved_HTML
│   ├── 123.html
│   ├── 124.html
│   └── ...
│
├── StopWords
│   ├── StopWords_Auditor.txt
│   ├── StopWords_Currencies.txt
│   └── ...
│
├── MasterDictionary
│   ├── positive-words.txt
│   └── negative-words.txt
│
├── missing_files.txt
├── analysis.py
└── README.md

Technologies Used

Python
Pandas
BeautifulSoup
NLTK
Regular Expressions

Python Libraries Required

Install dependencies using:

pip install pandas beautifulsoup4 nltk openpyxl

NLTK datasets are downloaded automatically by the script:

punkt
stopwords

How to Run the Project

Clone the repository

git clone https://github.com/your-username/blackcoffer-nlp-analysis.git

Navigate to the project folder

cd blackcoffer-nlp-analysis

Run the script

python analysis.py

Output will be generated as:

Output.xlsx

Notes

The script processes locally saved HTML files instead of live scraping.
This approach ensures stability in case of SSL issues or website restrictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML Article Text Analysis using NLP

Author

Project Description

Project Workflow

1. Input Loading

2. Stopwords Loading

3. Master Dictionary Loading

4. HTML Article Extraction

5. Text Cleaning

6. Sentiment Analysis

7. Readability & Complexity Metrics

8. Output Generation

Project Structure

Technologies Used

Python Libraries Required

How to Run the Project

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MasterDictionary		MasterDictionary
StopWords		StopWords
.gitignore		.gitignore
HTML Article Analysis.pdf		HTML Article Analysis.pdf
Input.xlsx		Input.xlsx
Output Data Structure.xlsx		Output Data Structure.xlsx
Output.xlsx		Output.xlsx
README.md		README.md
main.py		main.py
missing_files.txt		missing_files.txt

Folders and files

Latest commit

History

Repository files navigation

HTML Article Text Analysis using NLP

Author

Project Description

Project Workflow

1. Input Loading

2. Stopwords Loading

3. Master Dictionary Loading

4. HTML Article Extraction

5. Text Cleaning

6. Sentiment Analysis

7. Readability & Complexity Metrics

8. Output Generation

Project Structure

Technologies Used

Python Libraries Required

How to Run the Project

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages