fbpx
Thursday, 07 May 2020 04:10

How To Scrape the Dark Web

Author:  [Source: This article was published in towardsdatascience.com By Mitchell Telatnik]

Scraping the Dark Web using Python, Selenium, and TOR on Mac OSX

Warning: Accessing the dark web can be dangerous! Please continue at your own risk and take necessary security precautions such as disabling scripts and using a VPN service.

Introduction

Finding Hidden Services

Method 1: Directories

Method 2: Snowball Sampling

Environment Setup

TOR Browser

VPN

Python

Pandas

pip install pandas

Selenium

pip install selenium

Geckodriver

Firefox Binary

Implementation

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import pandas as pd
binary = FirefoxBinary(*path to your firefox binary*)
driver = webdriver.Firefox(firefox_binary = binary)
url = *your url*
driver.get(url)

Basic Selenium Scraping Techniques

Finding Elements

driver.find_element_by_class_name("postMain")

driver.find_element_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/a[1]')
driver.find_elements_by_class_name("postMain")

Getting the Text of an Element

driver.find_element_by_class_name('postContent').text

Storing Elements

post_content_list = []
postText = driver.find_element_by_class_name('postContent').text
post_content_list.append(postText)

Crawling Between Pages

for i in range(1, MAX_PAGE_NUM + 1):
page_num = i
url = '*first part of url*' + str(page_num) + '*last part of url*'
driver.get(url)

Exporting to CSV File

df['postURL'] = post_url_list
df['author'] = post_author_list
df['postTitle'] = post_title_list
df.to_csv('scrape.csv')

Anti-crawling Measures

captcha.png

driver.implicitly_wait(10000)
driver.find_element_by_class_name("postMain")
import pandas as pddf = pd.read_csv('scrape.csv')
df2 = pd.read_csv('scrape2.csv')
df3 = pd.read_csv('scrape3.csv')
df4 = pd.read_csv('scrape4.csv')
df5 = pd.read_csv('scrape5.csv')
df6 = pd.read_csv('scrape6.csv')
frames = [df, df2, df3, df4, df5, df6]result = pd.concat(frames, ignore_index = True)result.to_csv('ForumScrape.csv')

Discussion

[Source: This article was published in towardsdatascience.com By Mitchell Telatnik - Uploaded by the Association Member: Deborah Tannen]

AOFIRS

World's leading professional association of Internet Research Specialists - We deliver Knowledge, Education, Training, and Certification in the field of Professional Online Research. The AOFIRS is considered a major contributor in improving Web Search Skills and recognizes Online Research work as a full-time occupation for those that use the Internet as their primary source of information.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.