Scraping news with python

June 2020 | Permalink

Illustration by Barbara

In this article, I will explore how you can scrape a newspaper website and extract articles using python and the command line. In general, I would recommend python for everything that involves CLI for its excellent tooling.

Let's have a look at how this can be achieved using click (to install it you can pip install click), a python library that streamlines your work on the CLI:

import click
import importlib

@click.command()
@click.option("--scraper")
def main(scraper):
    pass

if __name__ == "__main__":
    main()

Since every website is built differently in terms of html elements, we would want to separate our scrapers into different classes and provide a dedicated extraction logic. A simple strategy pattern will allow us to pick the correct scraper according to the command line argument:

def get_scraper(scraper):
    try:
        name = scraper + "Scraper" # NewsPaperScraper class
        module = import_module(name)
        return getattr(module, name)()
    except ModuleNotFoundError as error:
        print(error)

Now that we have our scraper class and an available instance of it, we need a way to reach the news website and read its content. A simple get request in python can be carried out using requests (to install it you can pip install requests):

import requests
page = requests.get("http://www.newspaper.com")
page.text # your news in html

To extract articles from a piece of html we need an html tree traverser. Thankfully, there's a nice library called BeautifulSoup that does just that (among other things). Let's install it with pip install beautifulsoup4. Before using our scraped html with BeatifulSoup, we also need to make sure it's encoded correctly:

def get_parser(registered_scraper, html):
    try:
        html = bytes(html, registered_scraper.encoding)
    except Exception as error:
        print(error)
        html = ""

return BeautifulSoup(html, "html.parser")

Finally, let's implement our news scraper:

class NewsPaperScraper(object):
    location = "https://www.newspaper.com"
    encoding = "utf-8"

    def extract_articles(self, parser):
        # simplified, you might need to lookup specific DOM elements
        return list(map(self.extract_article, parser.select(".news-article")))

    def extract_article(self, article_block):
        return article_block.getText()

Wrapping it all up

You can have a look at a working example for BBC News on my github repo. Just clone the repo and install from requirements.txt :

pip install -r requirements.txt
python scrape.py --scraper=BBCNews