fbpx
Thursday, 07 May 2020 04:10

How To Scrape the Dark Web

Author:  [Source: This article was published in towardsdatascience.com By Mitchell Telatnik]

Scraping the Dark Web using Python, Selenium, and TOR on Mac OSX

Warning: Accessing the dark web can be dangerous! Please continue at your own risk and take necessary security precautions such as disabling scripts and using a VPN service.

Introduction

To most users, Google is the gateway to exploring the internet. However, the deep web contains pages that cannot be indexed by Google. Within this space, lies the dark web — anonymized websites, often called hidden services, dealing in criminal activity from drugs to hacking to human trafficking.

Website URLs on the dark web do not follow conventions and are often a random string of letters and numbers followed by the .onion subdomain. These websites require the TOR browser to resolve, and cannot be accessed through traditional browsers such as Chrome or Safari.

Finding Hidden Services

The first hurdle in scraping the dark web is finding hidden services to scrape. If you already know the locations of websites you wish to scrape, you are in luck! The URL’s to these websites are often not searchable and are passed from person to person, either in-person or online. Luckily, there are a couple of methods we can use to find these hidden services.

Method 1: Directories

Directories containing links to hidden services exist on both the dark web and the surface web. These directories can give you a good direction, but will often contain more well-known services, and services that are more easily found.

Method 2: Snowball Sampling

Snowball sampling is a crawling method that takes a seed website (such as one you found from a directory) and then crawls the website looking for links to other websites. After collecting these links, the crawler will then continue the process for those sites expanding its search exponentially. This method has the ability to find hidden services not listed in directories. In addition, these sites are more likely to draw serious criminals since they are not as transparent in their existence.

While the snowball sampling method is recommended for finding hidden services, its implementation is beyond the scope of this article.

Environment Setup

After the hidden services to be scraped have been identified, the environment needs to be setup. This article covers the use of Python, Selenium, TOR browser, and Mac OSX.

TOR Browser

The TOR browser is a browser that uses the TOR network and will allow us to resolve websites using a .onion subdomain. The TOR browser can be downloaded  here.

VPN

Running a VPN while crawling the dark web can provide you additional security. A virtual private network (VPN) is not required, but highly recommended.

Python

For this article, I assume you already have python installed on your machine with an IDE of your choice. If not, many tutorials can be found online.

Pandas

Pandas is a data manipulation Python package. Pandas will be used to store and export the data scraped to a csv file. Pandas can be installed using pip by typing the following command into your terminal:

pip install pandas

Selenium

Selenium is a browser automation Python package. Selenium will be used to crawl the websites and extract data. Selenium can be installed using pip by typing the following command into your terminal:

pip install selenium

Geckodriver

For selenium to automate a browser, it requires a driver. Because the TOR browser is running off of Firefox, we will be using Mozilla’s Geckodriver. You can download the driver here. After downloading, extract the driver and move it to your ~/.local/bin folder.

Firefox Binary

The location of the TOR browser’s Firefox binary will also be needed. To find this, right-click on the TOR browser in your applications folder and click on show contents. Then navigate to the Firefox binary and copy the full path. Save this path somewhere for later use.

Implementation

Now that you have set up your environment you are ready to start writing your scraper.

First, import the web driver and FirefoxBinary from selenium. Also import pandas as pd.

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import pandas as pd
binary = FirefoxBinary(*path to your firefox binary*)
driver = webdriver.Firefox(firefox_binary = binary)
url = *your url*
driver.get(url)

You can now scrape the hidden service like you would any website!

Basic Selenium Scraping Techniques

Whether you are beginner to Selenium or need brushing up, you can use these basic techniques to effectively scrape the website. Additional Selenium scraping tutorials can be found on the internet.

Finding Elements

A key part of scraping with Selenium is locating HTML elements to collect the data. There are several ways you can do this in Selenium. One method is by using the class name. In order to find the class name of an element, you can right-click it and click inspect. Below is an example of finding an element by class name.

driver.find_element_by_class_name("postMain")

You can also find elements by their XPath. An XPath represents the location of the element in the HTML structure. You can find the XPath of an element in the right-click menu of the HTML item in the inspect interface. Below is an example of finding an element by XPath.

driver.find_element_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/a[1]')

If you want to find multiple elements, you can use “find_elements” instead of “find_element”. Below is an example.

driver.find_elements_by_class_name("postMain")

Getting the Text of an Element

You can retrieve the text of an element by using the text function. Below is an example.

driver.find_element_by_class_name('postContent').text

Storing Elements

You can store elements by saving the element in a variable and then appending the variable to a list. Below is an example.

post_content_list = []
postText = driver.find_element_by_class_name('postContent').text
post_content_list.append(postText)

Crawling Between Pages

Some page-based websites include the page number in the URL. You can loop over a range and alter the url to crawl multiple pages. An example is below.

for i in range(1, MAX_PAGE_NUM + 1):
page_num = i
url = '*first part of url*' + str(page_num) + '*last part of url*'
driver.get(url)

Exporting to CSV File

After crawling a page and saving data into lists, you can export those lists as tabular data using Pandas. An example is below.

df['postURL'] = post_url_list
df['author'] = post_author_list
df['postTitle'] = post_title_list
df.to_csv('scrape.csv')

Anti-crawling Measures

Many hidden services employ anti-crawling measures to keep information secret and to avoid DDoS attacks. The most common measures you will encounter are captchas. While some captcha auto-solvers exist, oftentimes hidden services will use unique captcha types that the solvers cannot pass. Below is an example of a captcha found on a forum.

captcha.png

If the captcha is required at specific points (like first connecting to the server) you can use the implicit wait function in Selenium. This function will wait for a pre-determined time until the next action can be performed. Below is an example of this in use where Selenium will wait until it can find the element with the class name “postMain”.

driver.implicitly_wait(10000)
driver.find_element_by_class_name("postMain")

Other times, if the server identifies you are a robot, it will stop serving you. In order to bypass this, scrape the website in chunks instead of all at once. You can save the data in different csv files and them combine them with an additional python script using Pandas concat function. Below is an example.

import pandas as pddf = pd.read_csv('scrape.csv')
df2 = pd.read_csv('scrape2.csv')
df3 = pd.read_csv('scrape3.csv')
df4 = pd.read_csv('scrape4.csv')
df5 = pd.read_csv('scrape5.csv')
df6 = pd.read_csv('scrape6.csv')
frames = [df, df2, df3, df4, df5, df6]result = pd.concat(frames, ignore_index = True)result.to_csv('ForumScrape.csv')

Discussion

Scraping the dark web has unique challenges compared to scraping the surface web. However, it is relatively untapped and can provide excellent cybercrime intelligence operations. While hidden services often employ anti-crawling measures, these can still be bypassed, and provide interesting and useful data.

I want to reiterate that scraping the dark web can be dangerous. Make sure you take the necessary safety precautions. Please continue to research safe browsing on the dark web. I am not responsible for any harm that occurs.

[Source: This article was published in towardsdatascience.com By Mitchell Telatnik - Uploaded by the Association Member: Deborah Tannen]

AOFIRS

World's leading professional association of Internet Research Specialists - We deliver Knowledge, Education, Training, and Certification in the field of Professional Online Research. The AOFIRS is considered a major contributor in improving Web Search Skills and recognizes Online Research work as a full-time occupation for those that use the Internet as their primary source of information.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.