Scraping Wikileaks for Hillary Clinton Emails related to Bangladesh

Scraping Wikileaks for Hillary Clinton Emails related to Bangladesh


I wanted to get my feet wet in the scraping world. So, I decided to scrape wikileaks. I am from Bangladesh so I thought let's see what wikileaks has in store for Bangladesh. Hillary Clinton is the next probable president in the US. So, I decided to get the Hillary Clinton emails that are linked to Bangladesh. These emails were made public by wikileaks. With the scraped data, I wish to do some document modeling some other day inshaAllah. So, lets get to work!

Importing necessary libraries

We will import some libraries here to make our life easier along the way.

import requests, os, bs4, webbrowser, re, json
import pandas as pd

After looking at some emails in the wikileaks website, I decided to save the data in the following format.

data = {
    'title' : [] ,
    'from' : [] ,
    'to': [] ,
    'date' : [] ,
    'subject': [],
    'body' : []

}

Scraping wikileaks

There are 8 pages of search results for Hillary Clinton emails related to Bangladesh. So, we open the emails from each page, scrape them, store them in our dictionary then move on to the next page. The following code snippet has comments for all parts, so it's quite self-explanatory. I will leave it at that.

page = 1
base_url = "https://search.wikileaks.org/"
url = 'https://search.wikileaks.org/?query=bangladesh&exact_phrase=&any_of=&exclude_words=&document_date_start=&document_date_end=&released_date_start=&released_date_end=&publication_type%5B%5D=42&new_search=False&order_by=newest_released_date#results'
# There are 8 result pages, so we use this loop to go through each page and scrape.
while (page < 9):
    #set the url and go to it
    print "going to wikileaks for searching, page:", page
    res = requests.get(url)
    res.raise_for_status()


    # get the search result page
    soup = bs4.BeautifulSoup(res.text)
    linkElems = soup.select('.info a')

    #open each page and scrape data
    for i in range(len(linkElems)):

        # get a search result's content
        result_url = linkElems[i].get('href')
        result_html = requests.get(result_url)
        result_html.raise_for_status()
        result_soup = bs4.BeautifulSoup(result_html.text)

        #extract the data
        #title of the document
        title = result_soup.select('.tab-content h2')[0].get_text()
        #body of the mail
        content = result_soup.select('.email-content')[0].get_text()
        content = content.encode('utf-8')
        #Strip unnecessary white spaces
        content = re.sub(r'\s+ ', ' ', content)

        #get the from, to and subject data from header
        header = result_soup.select('#header')[0].get_text()
        # break down the header to from, to, date and subject fields to fit our dictionary format
        #get the sender
        sender = header.splitlines()[1]
        sender = sender.split(':')
        sender = sender[1].encode('utf-8')

        #get the receiver
        receiver = header.splitlines()[2]
        receiver = receiver.split(':')
        receiver = receiver[1].encode('utf-8')

        # get the date time
        dt = header.splitlines()[3]
        dt = dt.split(':')
        dt = dt[1].encode('utf-8') + dt[2].encode('utf-8')

        # get the subject
        subject = header.splitlines()[4]
        subject = subject.split(':')
        subject = subject[1].encode('utf-8')

        #add all the data to our dictionary
        data['title'].append(title)
        data['from'].append(sender)
        data['to'].append(receiver)
        data['date'].append(dt)
        data['subject'].append(subject)
        data['body'].append(content)


    #go to next page
    page += 1
    if page < 9:
        #get the next page's link
        next_page = soup.select('.next a')
        next_page_url = next_page[0].get('href')
        #set url
        url = base_url + next_page_url

#get the data into a pandas dataframe
email_leaks = pd.DataFrame(data)

going to wikileaks for searching, page: 1
going to wikileaks for searching, page: 2
going to wikileaks for searching, page: 3
going to wikileaks for searching, page: 4
going to wikileaks for searching, page: 5
going to wikileaks for searching, page: 6
going to wikileaks for searching, page: 7
going to wikileaks for searching, page: 8

All done! Let's take a look at our collected data.

email_leaks.shape
(364, 6)
email_leaks
... ... ...
body date from subject title to
0 \nUNCLASSIFIED U.S. Department of State Case N... 2010-09-01 0200 Akbar Zaidi PAKISTAN'S ROLLER-COASTER ECONOMY PAKISTAN'S ROLLER-COASTER ECONOMY: TAX EVASION...
1 \nUNCLASSIFIED U.S. Department of State Case N... 2010-10-24 0445 Hillary Clinton TRIP READING - - BANGLADESH IS FOLLOWING THE ... TRIP READING - - BANGLADESH IS FOLLOWING THE L... Lauren Jiloty
2 \nUNCLASSIFIED U.S. Department of State Case N... 2001-01-01 0300 Daily Sun EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH... EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH ...
3 \nUNCLASSIFIED U.S. Department of State Case N... 2010-12-05 2139 Hillary Clinton Lauren Jiloty
4 \nUNCLASSIFIED U.S. Department of State Case N... 2010-12-06 0007 Hillary Clinton MORE MORE Lauren Jiloty
5 \nUNCLASSIFIED U.S. Department of State Case N... 2010-12-06 0010 Hillary Clinton MORE MORE Melanne Verveer
6 \nUNCLASSIFIED U.S. Department of State Case N... 2010-12-05 2140 Hillary Clinton Michael Fuchs
7 \nUNCLASSIFIED U.S. Department of State Case N... 2010-12-08 2253 Hillary Clinton LATEST LATEST Melanne Verveer
362 \nUNCLASSIFIED U.S. Department of State Case N... 2012-06-07 0653 - -
363 \nUNCLASSIFIED U.S. Department of State Case N... 2010-04-27 0902 - -

364 rows × 6 columns

Saving the data in different file formats

We can save this dataframe in various format, thanks to pandas. This particular dataset has some encoding related problems with csv or excel format, so we save them in json and text file format. Here, I have shown two other formats just for demonstration, you can try these out too.

#Saving the data in different formats
#Latex
email_leaks.to_latex('leaks.tex')
#HTML
email_leaks.to_html('lix.html')
#JSON
email_leaks.to_json('mails.json')
#Text File
json.dump(data, open("mails.txt",'w'))

So, Scraping was pretty easy, thanks to the wonderful libraries available. Let's call it a day and grab some coffee. Adios!