Selenium, Tor and New Years Resolutions

Published 22nd December 2019 at 06:37am UTC (Last Updated 14th April 2020 at 06:36pm UTC)

Every year I tell myself that I'll get serious about blogging. Every year, I start a post - sometimes two - crappy little drafts that I never get round to finishing. Something's always in the way. I think I'll aim for something shorter from now on. Shorter posts, more frequent.

I could talk about some of the awful (and in one case dangerous) people I had to deal with at my last job. That job isn't listed in my "about" page at the moment because I haven't had time to update the site.

Since I left RPS, I've been

a little

preoccupied.

Perhaps I'll discuss it all another time. Ideally at a more social hour (it's 3:09am).


 

Selenium


Did you know that you could write selenium commands with Tor? Found this really cool library earlier this week that lets you do it. The repo contains a few examples so definitely worth having a read through.

To try it out I'd suggest the following as prerequisites:


  • Python 3 (I mean, you could probably use a different language for this if you really wanted to but most of the examples I found online were already in Python - which I'm not even a fan of - *shrugs*)
  • Jupyter Notebook (just saves you having to run python -s fileName.py each time so not a requirement per se...it also doesn't even render in Github properly anymore...maybe you're better off leaving it out, idk)
  • Selenium (pip install selenium)
  • TB Selenium (pip install tbselenium)
  • The stem package (pip install stem)
  • Tor (duh!)


In your script, first add your imports:

import tbselenium.common as cm
from tbselenium.tbdriver import TorBrowserDriver
from tbselenium.utils import launch_tbb_tor_with_stem
from selenium.webdriver.common.keys import Keys
import time

Most of them are from the WebFP tor-browser-selenium library itself but the Key class is imported from the original Selenium WebDriver and of course time is a built-in Python module.

One you've installed Tor, assign the file location to the tbb_dir variable in your script...

tbb_dir = "/absolute/path/to/tor-browser_en-US/"

...this directory will be referred to in the following line. Here you will launch a new Tor process with the stem package:

tor_process = launch_tbb_tor_with_stem(tbb_path=tbb_dir)

The code that you want to run will need to take place within the following block:

with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
    # do selenium stuff

Fun fact about Selenium: it's pronounced "se-leh-nium". I know this because one of its inventors told me this...or perhaps it was a group of us that he told. It was probably a whole tech talk. It was a long time ago. My point is, the English language is dumb.

Now load the page you want to visit. Where should we go?

How about the Hidden WIki?:

with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
    driver.load_url("http://zqktlwi4i34kbat3.onion/wiki/index.php/Main_Page")

Note, make sure your browser's set to "Safest" in the security settings. Onion sites take forever and a day to load with the "Standard" settings for me. No idea why. Should probably care more but it's late early and I don't.

To fix this, open an instance of the Tor browser locally and change the settings manually like so:

Tor Security Settings

Then close the browser and in a terminal window, run:

sudo pkill tor

Running that command before running your finished script should also resolve the error described in this Github issue (OSError: Process terminated: Failed to bind one of the listener ports.) if you run into it.

Next, add any of the actions that you would like Selenium to perform into the block. It's late (or early...it's 4:45am now) so lets keep this to a simple search. We can look for the term "pgp" as I still know next to nothing about those:

el = driver.find_element_by_name("search")

In the code above, we're looking for an element on the web page with the name "search". This is actually not the best way to do it...in fact, as the field itself looks like this...


Hidden Wiki search input source code


...you'd have probably been better off with something like...

el = driver.find_element_by_xpath("//input[@id='searchInput']")

Either one works though so whatever. The commands after that essentially clear the field as a precaution, input text in the search field (el.send_keys("pgp")) and hit the return key (el.send_keys(Keys.RETURN)).

Once you're done, always make sure you remember to kill the tor process:

tor_process.kill()

So your code should look like this:

import tbselenium.common as cm
from tbselenium.tbdriver import TorBrowserDriver
from tbselenium.utils import launch_tbb_tor_with_stem
from selenium.webdriver.common.keys import Keys
import time

tbb_dir = "/absolute/path/to/tor-browser_en-US/"
tor_process = launch_tbb_tor_with_stem(tbb_path=tbb_dir)
with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
    driver.load_url("http://zqktlwi4i34kbat3.onion/wiki/index.php/Main_Page")
    el = driver.find_element_by_name("search")
    time.sleep(1)
    el.clear()
    el.send_keys("pgp")
    el.send_keys(Keys.RETURN)
    time.sleep(5)

tor_process.kill()

There's other things you can do to improve the script. Running a headless Tor browser's good for when you want it to run in the background. To run that, first install xvfb. On Manjaro, I had to run:

sudo pacman -S community/python-xvfbwrapper
sudo pacman -S extra/xorg-xdpyinfo

Don't know what you'd have to do on anything else sadly.

Then import the xvfb helpers from the tbselenium package:

from tbselenium.utils import start_xvfb, stop_xvfb, launch_tbb_tor_with_stem

And wrap your code in a block like the following:

xvfb_display = start_xvfb()
with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
    # do selenium stuff
stop_xvfb(xvfb_display)
tor_process.kill()

You can also use the WebDriverWait class to wait for the page to load a specific element before running a command. In a different script, I thought it made sense to create a helper method like this one:

def wait_for_it(driver, xpath):
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "{}".format(xpath)))
    )

    return driver.find_element_by_xpath("{}".format(xpath))

Here driver is the driver from before and xpath would be a string like "//input[@id='searchInput']". This saves us using time.sleep() all the time which is a little tedious is the page takes 10 seconds to load some of the time and 3 seconds most of the time, &c.


 

Other news:


  • I've started using Manjaro which, from what I've heard, is like Baby Arch. I might post if I stumble upon something interesting but idk.

Baby Linux

  • To all of my fellow countrymen who voted the wrong way on the 12th and those who helped spread vicious lies about a good man prior to that: when the UK slips into third-world levels of poverty because the man that you elected sold off or drove out everything and everyone that made this country worth a damn, I hope that you can live with knowing that your consent made it all possible.

 

Ok, signing off for the night morning as it's 6:37 and I was meant to go to sleep over four hours ago.