Web Scraping Economic Data from Forex Factory

webscraping economic data

Today I would be making some soup. A beautiful soup. Beautiful Soup is a library in Python to extract data from the web. This lesson was particularly gruelling and challenging for me. I spent a couple of nights troubleshooting issues one after another, and another. Took me about 1-2 weeks to learn the very basics of beautiful soup in python.

Initially, when I was learning beautiful soup, I thought to myself what projects could be useful in the area of finance. Scraping stock prices and volume data is certainly not worth the time. It can easily be done with the yfinance library, alpha vantage or pandas data_reader.

Similarly, visiting a weather forecast website to scrape data wouldn’t be that value-adding either if I can just visit that web page directly and view it with ease.

So it has to be something that is useful, automated, time-saving and ideally insightful. After much thought, I decided to scrape economic events from Forex Factory.

Staying on Top of Economic Trends

Forex Factory has a calendar list of upcoming and past economic events with actual, forecast and previous data points. Some of the more common ones include consumer price index, manufacturing index, fed’s rate, monetary policy meeting, non-farm payroll and etc. These are all leading indicators that tell us what is going on in the global economy. This is an example of how the calendar looks like.

web scraping forex factory

Initially, I thought of scraping the fed’s repo operations and do a correlation with the bubbly S&P 500 market. But I gave up on that idea almost immediately upon visiting the page. They have done an excellent job of putting up a veil with unnecessary financial jargon. Or maybe it’s just me.

Anyways, what I wanted to do was to filter out all the high impact events along with the corresponding source links, actual, forecast, previous as well as the entire historical data points of each event. Then export it out to excel at every month-end for my personal consumption. That’s the objective of this project.

In this article, I would attempt to explain how Beautiful Soup works and how I scrape economic data from forex factory, as simply as possible.

Overview of Beautiful Soup & Selenium

This is a slightly more advanced topic as you have to first have a basic knowledge of python and HTML. I don’t really have much experience in python, just 1 month. But that is enough for me to get started in learning other libraries.

On the high level, what BeautifulSoup does is it crawls through the entire web page HTML source code to search for specific tags that you asked for. It can return you all the values that you see on the website but the downside is it’s not interactive.

For example, it can’t click a button. It can’t type in a username or password. It can’t click go to the next page. Your access to information is only limited to what you see on the webpage.

This is where Selenium comes in. Selenium is used to enable all these javascript functions. You can tell Selenium what buttons you want to click, what you want to type in the search box and etc.

It is quite fascinating to see it for the first time as your computer runs automatically, at a very fast speed, doing all sort of things that you instructed it to do. It is as though someone has taken control of your computer.

The advantage of selenium is it allows BeautifulSoup to scrape a broader scope of data. The downside is it slows things down. Ideally, using requests would be preferred over selenium and the latter should be used as a last resort.

Part 1 – Making the Soup in BeautifulSoup

import pandas as pd
import time 
import requests

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://www.forexfactory.com/calendar.php?month=last'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')

To start off, these are some of the libraries that I have used. I used pandas to create a data frame and store all the collected data. Time is used to refresh the page.

The request is used to send a request to the server about a particular website. Then the server will send back a response to the user. That is how websites work on the internet generally.

In this case, I used the forex factory’s calendar link for last month as my URL variable. then I am telling requests to go get me this link and send me a response back.

After we receive back the response from the server. We are going to convert the information into text format and feed the data to BeautifulSoup. BeautifulSoup will then parse the document using ‘lxml’. This process is called “Making the Soup”. I did not make this up, it is directly from the documentation.

Part 2 – Searching the Tree in BeautifulSoup

Now that we have a soup, we can find anything and everything about the website.

The first thing to do is to narrow down the search. All the calendar events are actually contained inside a table. But there are many tables on the website also. So we have to tell soup which table it should focus on.

forex factory calendar events

You can use the inspect tool to hover around the elements on the website and locate which areas you want to focus on. After searching around, it was found that the table which contains ALL the calendar events has a class name of “calendar_table”. So we are going to tell soup to search ONLY inside this table and not anywhere else.

table = soup.find('table', class_='calendar__table')

In the above code, we are telling soup to find the first table tag name with the class name of “calendar_table”. Then store it inside a variable called table. Now table would contain this entire set of calendar events that we are interested in. That’s the second step.

Part 3 – Searching within the Table

The third step is to understand an HTML table structure. I found a good image that sums up what goes on inside a table tag.

Inside a table, it usually contains the table head, table row and table data. Similarly, in the above table that we just crawled and stored. It also has table rows and table data tags.

The next task is to decide what information to be retrieved. Then determine the specific location of where this information resides in the table tags. There are a total of 6 items we are interested in.

  1. Currency (Country)
  2. Event Name
  3. Actual figure
  4. Forecasted figure
  5. Previous figure
  6. Source link

Using the same method as to how the calendar table was found, we just hover across the elements to find out what are their corresponding tag names and class names. For example, I have done one for CaiXin Manufacturing PMI

You can see that all of them are located in between the table data tags, each with a different UNIQUE class name. So we are going to tell soup again to search all these tag names under the table. Now it is going to search only within the table as we have narrowed it down earlier.

The next consideration is to set up a filter for high-impact events.

For example, if the python soup is searching through each event in the table, it should ONLY extract information about those that are high impact. The rest should be ignored. Hence, a conditional statement must be included.

Part 4 – Extracting the URL links

Lastly, we need the URL links that enable selenium to extract the latest news release. This is because the links don’t exist on our current webpage. You have to click the “open detail” folder in order to view more information.

forex factory economic data

For example, let’s say I want to get the ISM Manufacturing PMI link. I have to first click the file on the folder icon on the right, then it would bring up a new page as shown in the URL link. Only then it would show the source link that we want. If not you can’t access this latest release link from the original webpage.

Now we have to find a way for this to happen in each loop as soup searches through. One way could be to use selenium and click on it. But I figured out another faster method using requests.

If you notice the URL link of each event detail, the base URL is the same except it adds a detail id at the end of the link. For example:

https://www.forexfactory.com/calendar.php?month=last#detail=111377
https://www.forexfactory.com/calendar.php?month=last#detail=111365
https://www.forexfactory.com/calendar.php?month=last#detail=115868

The only thing that differentiates each event’s link is their detail id number at the end of the link. Hence, we can get an event’s link simply by extracting the id number, then concatenate it with the main URL link. This would give us all the additional information that we required. Here is the code to do it.

links.append(url + "#detail=" + row['data-eventid'])

Note that these links that we stored are not the actual source link itself. It only opens up an additional information box with regards to the event detail. It is only from there where we extract the original source link.

Part 5 – Putting it all Together

Here is a summary of the code that I have written based on the logic that we have defined in all the above steps.

list_of_rows = []
links = []

#Filtering events that have a href link
for row in table.find_all('tr', {'data-eventid':True}):
    list_of_cells = []
    
    #Filtering high-impact events
    for span in row.find_all('span', class_='high'):
        links.append(url + "#detail=" + row['data-eventid'])
        
        #Extracting the values from the table data in each table row
        for cell in row.find_all('td', class_=[
          'calendar__cell calendar__currency currency', 
          'calendar__cell calendar__event event', 
          'calendar__cell calendar__actual actual', 
          'calendar__cell calendar__forecast forecast', 
          'calendar__cell calendar__previous previous']):
            
            list_of_cells.append(cell.text)
        list_of_rows.append(list_of_cells)

First, we create some empty lists. Then we ask Python to loop through each row in the table that we just filtered out. There are two conditions in the loop. One is it has to have a link and second is it must be a high impact event. In each loop, we are going to store the data event id into a separate list. This list would be used later to collect the source links.

Finally, it is also going to extract all the table data values such as currency, event name, actual, forecast and previous. Then we just append them into list_of_cells. For each list in list_of_cells, we are going to append it again to list_of_rows. So list_of_rows will contain a list of lists. That is a brief overview of what is happening inside the code.

Let’s check out how the first 10 list items look.

Alright, looks pretty good. Just some data cleaning to do. The first column has some characters ‘\n’ that needs to be removed. In order to do that, let’s put all these list data inside a nice pandas data frame.

Part 6 – Converting it into a Data Frame

df = pd.DataFrame(list_of_rows, columns =
['Country', 'Event Title', 'Actual', 'Forecast', 'Previous'])

df.iloc[:,0] = df.iloc[:,0].str.split('\n').str[1]
df = df.set_index(df.columns[0])
df = df.sort_values('Country')

The first line of code is saying convert my list_of_rows into a data frame, then set the column names as “Country”, “Event Title”, “Actual”, “Forecast” and “Previous”.

The second line of code is where the cleaning is done. It is saying take the entire string values in the first column and split them by the delimiter ‘\n’. This would throw up three separate data, but we are interested only in the middle value, which is the currency name. Hence, str[1].

The third line of code is just setting the “Country” column to be the index column.

And lastly, I sorted it out by country name.

Let’s see how the data frame looks now.

BAM! Looks pretty neat now. We have successfully extracted all the high impact events with their corresponding actual, forecast and previous values. There are a total of 65 high-impact events in January. Now there is just one thing that is missing, the source link.

Part 7 – Calling in Selenium

forex factory economic data

In this series, I haven’t really called out the power of selenium. But I chose to use selenium because I intend to grab all the historical actual values for each event detail.

You can see the 2 red boxes on the right. What selenium can do is click the more button repeatedly until it goes all the way back to the earliest date. When the entire table of historical data points is fully displayed, I would ask python to go and grab all these dates and actual values.

However, this enhanced feature is still a work in progress at the moment. Now I am just releasing the first basic version which is the green box on the left. It is just going to grab the latest release source link. If you didn’t see part 2 after a few weeks. It means my project has failed.

url_links = []
driver = webdriver.Chrome('chromedriver.exe') 

for link in links:
    driver.get(link)
    driver.refresh()
    time.sleep(2)
    link_text = driver.find_element_by_link_text('latest          release').get_attribute('href')
    url_links.append(link_text)

Remember in the earlier part I extracted the event id in each loop iteration then append it with the URL base link? All these links are stored in the links list.

The code above is looping through each and every single link in the list to get the source link. I make the page refresh and sleep for 2 seconds. This is because sometimes it throws up an error when the page is not even loaded, but python rushed off to search for it. It took about 3 minutes+ for python to return me all the source links. Let’s check out the first 10 of them.

Looks good so far. Now let’s put all these links into our original data frame.

web scraping economic data

Part 8 – Extracting it to Excel

Done! Everything looks perfect. The last and final step is to extract this data frame into an excel file.

df.to_excel("Monthly Economic Data.xlsx", sheet_name='Economic Data')

TADA! Here is the final output of v1.0. Hopefully, v2.0 materialises in a few days.

web scraping economic data

This is the entire list of high-impact economic events that have happened in January 2020. From unemployment rates to the monetary policy statement, Retail sales to crude oil inventories. Python and BeautifulSoup have scraped everything you need to know about global economics. Since I have already written the code, I can easily ask them to search again for Feb, Mar and so on.

Hopefully, this article provides a brief overview of what BeautifulSoup does and the potential use case of it.

Lastly, if you are interested to learn the recipe of making a beautiful soup, do check up the python resource page where I consolidate all my learning materials.

  •  
    31
    Shares
  • 31
  •  
  •  
  •  
  •  
  •  
  •  
  •  

1 Comment

  1. Really excited to get this working, I’ve ran into an error though. :/

    “for row in table.find_all(‘tr’, {‘data-eventid’:True}):
    AttributeError: ‘NoneType’ object has no attribute ‘find_all'”

    Any idea as to why this might be?

    Thanks again

Leave a Reply

Your email address will not be published.