Generating Word Cloud for TheBabylonians in Python

Generating Word Cloud for TheBabylonians in Python

Won’t it be cool to see what are the highest frequencies words that appear on your website? What do you write about most of the time? That is what a word cloud is and that is the objective of this article.

I came across the Python word cloud as I was going through Google’s Python Crash Course on Coursera. Now I am back to learning Python again as the markets are overvalued. There is nothing to buy and hence nothing to write. So might as well invest in oneself and do some other cool & interesting stuff.

1. Intuitive Explanation of Python Project

The goal of this project is to write a python script that counts the frequency of every single word. I have published a total of 76 blog posts and they are all dispersed across 6 different categories. You can see them at the top menu header of my website: Stocks, REITs, Crypto, Technical Analysis, Python and Personal Thoughts.

I need the URL link of all my 76 blog posts so that I can convert them into a text format for Python to count. That is the first problem to think about. How do I loop through each different categories and extract the URL links of each individual posts on my website?

Once we have gotten the links of all the blog posts, the next step is to clean up the data a bit and convert it into a text string. All the words that are in script tags or style tags are irrelevant to us. All the empty spaces, new lines, blanks, punctuations and etc are meaningless. Then you also have stop words like “the”, “and”, “or”, “what”, “they” and etc. which doesn’t add any meaning.

We don’t want Python to count those stuff and such words should be excluded. So the code needs to have some sort of filter logic to do this. That is the second problem. Data cleaning & filtering.

And finally, once the data is cleaned up, we have to think of a way to count and store the words. For this, I use Python’s dictionary function to tabulate the results. If the word is in the results, then it would be count+=1 else it would be 1.

To sum up, the Python code should do something like this. For every blog article, scan through every single word and count them. Record the frequency count and store them in the result dictionary. Repeat this for all 76 articles. This is how the result should look like at the end.

2. Python Explanation of Word Cloud Project

Here are the libraries and modules that I have imported.

import urllib.parse
import requests
import urllib.request as ur
import wordcloud
import matplotlib
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

I never thought I would have to use BeautifulSoup again for web scraping. Luckily I have learnt that earlier this year, so it doesn’t sound so foreign to me now.

2.1 Collate URL Links of ALL my Blog Articles

The 1st problem as mentioned earlier is to collect all the URL links and store them into a list. I used a for-loop to iterate through each category and use BeautifulSoup to search for the link with a href tag. Here is how the code looks like.

page_urls = 
[
'https://www.theancientbabylonians.com/category/guide-to-stocks-analysis/',
'https://www.theancientbabylonians.com/category/guide-to-reits-analysis/', 'https://www.theancientbabylonians.com/category/investing-in-cryptocurrencies/',
'https://www.theancientbabylonians.com/category/technical-analysis/',
'https://www.theancientbabylonians.com/category/python/',
'https://www.theancientbabylonians.com/category/personal-thoughts/'
]

First I store each category base link in a variable call page_urls.

Then loop through each of of the category and do the following:

blog_titles = []
for url in page_urls:
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data, 'lxml')
    
    for link in soup.find_all(class_ = 'entry-title mh-loop-title'):
        link = link.find('a', href=True)
        if link is None:
            continue
        blog_titles.append(link['href'])

In each category, ask BeautifulSoup to find ALL the content inside the class title ‘entry-title mh-loop-title’. Then find ALL the ‘a’ tags with href = True. This is where the URL links of my blog articles reside in the source code. Finally, append the URL link to the list which I name “blog_titles”.

Here is how the list looks like after looping through each category.

2.2 Data Cleansing & Word Count

We have solved the 1st problem! Now that we have collated ALL the blog links, the next step is to iterate through the loop again and count the words in each blog post.

result = {}

for blog in blog_titles:
    req = urllib.request.Request((blog), headers={"User-Agent": "Chrome"})
    html = ur.urlopen(req).read()
    soup = BeautifulSoup(html)
    
    #STEP 1: DATA CLEANING
    
    # kill all script and style elements
    for main_content in soup.find_all(class_='mh-content'):
        for script in main_content(["script", "style"]):
            script.extract()    # rip it out

        # get text
        text = main_content.get_text()

        # break into lines and remove leading and trailing space on each
        lines = (line.strip() for line in text.splitlines())
        
        # break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        
        # drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)
    
    #STEP 2: DATA FILTERING
    
    #Listing out all the non-value words
    punctuations = '''!()-[]{};:'"\,<>./[email protected]#$%^&*_~'''
    stopwords = set(STOPWORDS)
    stopwords.update(["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", 'long', 'always', \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", 'reit', \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", 'low', \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just", \
    "about", "for", "on", "would", "like", "not", "in", "think", "because", "why", "using", "new", "review", "building",\
    "should", "reasons", "changyuesin", "want", "jul", "now", "still", "first", "means", "us", "see", "look", "much", "net",\
    "one", "really", "next", "use", "price", "go", "every", 'way', 'back', 'going', 'cash', 'time', 'money', 'total',\
    "know", "even", "span", 'higher', 'average', 'around', 'buy', 'used', 'might', 'take', 'market', 'good', 'reason',\
    'company', 'companies', 'business', 'make', 'high', 'email', 'different', 'pm', 'lower', 'past', 'right', 'bank', 'made', 'pay',\
    'simply', 'many', 'started', 'cloud', 'need', 'last', 'years', 'year', 'start', 'increase', 'flow', 'big', 'current', 'already',\
    'interest', 'people', 'prices', 'whether', 'million', 'based', 'read', 'sell', 'buying', 'better', 'things', 'thing', 'another',\
    'return', 'returns', 'probably', 'line', 'strong', 'previous'])
          
    words = text.split(" ")
    for word in words:
        y = word.lower()
        if y not in stopwords and y.isalpha():
            if y not in result:
                result[y] = 1
            else:
                result[y]+=1

First, create an empty dictionary called “result”.

Second, pass in the URL link of my blog and send a request to the server. It returns back a response object and we call the read method to read the source data.

Third, parse it into BeautifulSoup and clean up all the unwanted data as seen under the comment #STEP 1: DATA CLEANING. Then convert the entire blog article into a text string.

Fourth, insert in words that we don’t want Python to count. The list is pretty long and I have to manually add in words that I find meaningless after the output results. So this is an arbitrary and iterative process. Probably not the best way to write a piece of code.

Finally, check for 3 conditions. Is the word in our list of stop words? Is the word a string of alphabets? And lastly, does the word exist in our result dictionary. If yes, add 1 count. If no, that word would be the 1st count. That’s it!

2.3 Summary of Python Codes

In summary, the python code will loop through all the categories and append all the URL links of each blog article into a list variable called “blog_titles”. After which, it would loop through again over each URL in “blog_titles” to clean, filter and count all the words. Finally, we check for the conditions of stop words and tabulate the frequency counts of each filtered word.

It takes about 45 seconds for Python to search and count every single word in all 76 published articles. That is the power of computers. Imagine asking humans to do this. *Kind of get reminded of my audit intern life

3. Final Output – TheBabylonians Word Cloud

# Display your wordcloud image

cloud = wordcloud.WordCloud(max_words=100, width=800, height=600, min_font_size=14, colormap='Reds')
cloud.generate_from_frequencies(result)
plt.figure(figsize= (20,20))
plt.imshow(cloud, interpolation = 'nearest')
plt.axis('off')
plt.show()

Now that we have the result dictionary score, the last and final step is to activate the word cloud function that we imported. Here is the end result of Word Cloud on TheBabylonians. The top 100 words that appeared the most on my website.

The first few big words that pop up are “gold”, “crypto”, “bitcoin”, “growth”, “value”. The rest seems to be words that are used for analysing financial statements. Common terms like “financial”, “ratio”, “cost”, “revenue”, “earnings”, “operating”, “income”, “balance sheet”, “capital”, “debt” and etc. Other interesting terms that appear are “blockchain”, “python”, “Singapore” and “data”. “Banks” and “REITs” along with “rates”, “dividends” seem quite common also as that is the only 2 sector I analyse in Singapore. There are also some “Senkou”, “Kumo” which is my Ichimoku Guide in the technical analysis category.

This is a pretty interesting python project to explore and I would say the results are quite accurate. I do indeed write a lot on gold, bitcoin and crypto in general.

However, it is not perfect as you can still see a lot of stop words that don’t add any meaning to a sentence. The scope of search might not be well-defined as I might have missed out some tag words or class name in BeautifulSoup. My codes are clearly not pythonic also, but it is good enough for me as long as it works.

Try it out for your own and see what words make up the essence of your website or community forum!

For TheBabylonians, it is gold and crypto.

  •  
    15
    Shares
  • 15
  •  
  •  
  •  
  •  
  •  
  •  
  •  

2 Comments

Leave a Reply

Your email address will not be published.