Building a web scraper with Python and Selenium.

A web scraper with Selenium and Pandas library to store the data in a CSV file.

I recently started to feel curious about web scrapers and began to build one.

In this tutorial, we will use Selenium to build our web scraper to search wheat prices and store them in a CSV file using the Pandas library.

First, let's install our dependencies:

pip install selenium pandas

This is the website where we will use our scraper:

Website

We will use our scraper to extract the historical data from this table:

historical_data.png

Now we have to identify the HTML tags that contain the data we want to extract.

image.png

In the image above we select the page element, we want to extract and use "Inspect"(right-click in the mouse), and it shows us that the < td > tag has the data but the < tr > tag has the row. I think is easier in this case, for learning purposes, to extract the whole row.

main.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd

options = Options()
options.headless = True

driver = webdriver.Chrome(options=options, executable_path="C:/path/to/chromedriver.exe")

driver.get("https://www.macrotrends.net/2534/wheat-prices-historical-chart-data")

rows = driver.find_elements(By.TAG_NAME, "tr")

list_rows =[]
for row in rows:
    list_rows.append(row.text)
    df = pd.DataFrame(list_rows)
    df.to_csv('prices_wheat.csv')

driver.quit()

We import the packages we are going to use, we initialize the class Options(), and we set headless as true, so our scraper does its job without using Chrome GUI.

Then, we use webdriver.Chrome() and we pass options and the path of where the driver is located as arguments. If you don't have Chrome's driver, here you may download it.

We use the get() function and pass the URL of the site we are going to extract. Then, we define a variable, in this case, rows and use the find_elements() method and pass the tag name we are going to store in the rows variable, in this case .

We define a variable as an empty list, then iterate through rows and append every row to the list of rows, we convert the list into a panda's dataframe, using the Dataframe method, and save it as a CSV file. After that, we use the quit() method.

To run this program, run this code in your command line:

py main.py

This is the CSV file it will generate.

image.png

I will change the code a little bit so every value is separated by a comma.

for row in rows:

    list_rows.append(str(row.text).replace(" ", ","))

    df = pd.DataFrame(list_rows)
    df.to_csv('wheat_prices.csv')

image.png

Conclusion

For me, it was the first time I use Selenium, I don't have any experience building scrapers, but I was interested in building one to use the data to fill a database and use it in an API. If you have any suggestions about how to improve the scraper, please leave a comment.

Thank you for taking your time and read this article.

If you have any recommendations about other packages, architectures, how to improve my code, my English, or anything; please leave a comment or contact me through Twitter, LinkedIn.

The source code is here.

Reference