RPA Challenge (Shortest Path) with Python, Selenium & Tesseract

Few weeks ago I found and tried myself in RPA challenge. It was fun indeed. There have been few new ones released lately. I’m taking closer look into one of these – shortest path. It seems to even more interesting!

What’s the task then? Authors want challengers to do attended bot, but I decided to do it in unattended mode as it is simply doable. Goal of the exercise, in short summary, is to match pairs of balloons on the map, each red with closest green. Then read data shown below the map, fill the form with the data and submit. Difficulty here is data table rows comes in random order and most of row headers are images with text and noise. It is not that easy to read it and match with corresponding field on the form.

I did it using 2 different ways – with and without OCR (Optical Character Recognition). The one with no OCR is much faster (~7-8 seconds for entire exercise), for OCR I used Tesseract OCR (~40 seconds). The final time is not that important as both methods require few tricks to make them work well and this what’s much more interesting here. Let me share the details.

Selecting right points on a map

Let me first describe how we can make bot to select points (balloons) on map in unattended mode. Aim is to select red balloon (demand) with closest green one (supply). For human operator that’s no problem unless points are too close to each other. Bot needs to do some calculation though.

One possible way to do it is to locate image position on the screen and then calculate distances. However not all balloons are visible without scrolling the map. Let’s see what’s the code behind them.

Highlighted CSS function moves the object in 3-dimensional space, pixel numbers are reposition vector coordinates (x, y, z). 3rd dimension is not used (always 0px). Assuming we have x, y coordinates now, we can calculate the distances between the balloons using Pythagorean theorem and choose shortest distances.

(x1, y1), (x2, y2) are coordinates of respectively balloons 1 and 2. c is the distance we’re looking for. And a & b are 2 remaining sides of right triangle (with c as hypotenuse). a = (x2 – x1), b = (y2 – y1) therefore c = sqrt((x2 – x1)^2 + (y2 – y1)^2).

Here’s python code for that. Function takes red balloon we’re looking to find a pair for and list of all green balloons as arguments. It calculates distances between red balloon and all green balloons, choosing smallest value and returning green balloon object.

#calculate closest (to 'balloon') balloon from targets list
def findClosestBalloon(balloon, targets):
    closest_distance = 10000
    closest_target = None
    balloon_coords = getBalloonCoord(balloon)
    for target in targets:
        target_coords = getBalloonCoord(target)
        a = abs(int(balloon_coords[0]) - int(target_coords[0]))
        b = abs(int(balloon_coords[1]) - int(target_coords[1]))
        c = pow((a*a+b*b),(1/2))
        if c<closest_distance:
            closest_target = target
            closest_distance = c
    return closest_target

#get red/green balloon coordinates
def getBalloonCoord(balloon):
    styles = balloon['style']
    styles = styles.split(";")
    for style in styles:
        style = style.split(":")
        if style[0].strip() == "transform":
            coords = style[1].strip().replace("translate3d(","").replace(")","").replace("px","").split(",")[0:2]
    return coords

Reading text from images

So what’s the challenge here? Most of the details’ headers are noised images with text. To properly find out which detail is in a row we need to read text from image. But is this necessarily required?

You have to be clever!

We have everything we need to correctly capture or rather to do an educated guess what is represented in each table row. How? Through data analysis.

  • Ship preference – can be only three values: Enclosed, Flatbed or SteepDeck
  • Cargo preference – can be two values only: Urgent or Permit required (Premit required actually)
  • State – 2 characters, non-numerical
  • Zip Code – 5 characters, last always numerical
  • Demand date – 10 characters, 3rd and 6th = “-“, remaining characters numerical
  • Address 2 – you can find in balloon popup

Address 1, City and Cargo remains with no clear rules, as above, we can use. What to do now? Assuming noise on images is random and knowing the texts are different in characters length, we can try to see what’s the average color of an image (presumably ‘Address 1’ will be darker than ‘City’). Another trick here is the images are provided to webpage as base64 encoded data.

from PIL import Image
import numpy as np
from io import BytesIO
import base64

#getting image data
img_base64 = td_img['src'].replace("data:image/png;base64,", "")
#decoding image data, and reading the image
img = Image.open(BytesIO(base64.b64decode(img_base64)))
#calculating image mean
img_mean = np.mean(img)

Results of this exercise are promising, ‘City’ image mean is in range (1.2, 1.75), ‘Cargo’ image mean in range (1.75, 2.17) and ‘Address 1’ in (2.17, 3). Important to mention that ‘Address 2’ image mean is very similar to ‘Address 1’ so we’d need to exclude it. This way we have complete set of rules to assign our variables.

if len(tds[1].text)==2 and tds[1].text.isnumeric()==False:
    demand_state = tds[1].text
elif tds[1].text=="Enclosed":
    demand_ship_preference = "Enclosed"
elif tds[1].text=="Flatbed":
    demand_ship_preference = "Flatbed"
elif tds[1].text=="SteepDeck":
    demand_ship_preference = "SteepDeck"
elif tds[1].text=="Premit Required":
    demand_cargo_preference = "Premit Required"
elif tds[1].text=="Urgent":
    demand_cargo_preference = "Urgent"
elif len(tds[1].text)==10 and tds[1].text[2:3]=="-":
    demand_date = tds[1].text
elif len(tds[1].text)==5 and tds[1].text[-1:].isnumeric()==True:
    demand_zip = tds[1].text
elif img_mean>1.2 and img_mean<1.75:
    demand_city = tds[1].text
elif img_mean>1.75 and img_mean<2.17:
    demand_cargo = tds[1].text
elif img_mean>2.17 and img_mean<3 and tds[1].text!=demand_address2:
    demand_address1 = tds[1].text

We can treat supply data similarly.

Tesseract OCR

Second approach is to read image text with OCR. I used Tesseract OCR as it is first of all free and secondly considered best among free OCRs. I already mentioned that images have, apart of text, plenty of random noise therefore OCR didn’t initially returned good results. What we can do about it? Denoise it!

import cv2
from PIL import Image

img = np.array(img) #convert image to array
alpha = img[:,:,3] # extract alpha
img = ~alpha   # invert b/w
_, blackAndWhite = cv2.threshold(img, 140, 255, cv2.THRESH_BINARY_INV) #apply treshold with cv2
img = cv2.bitwise_not(blackAndWhite) #invert again

Results are not stunning but enough to improve Tesseract hit rate.

top: orignal, bottom: denoised

However this is still not it. Some results are distorted ‘Slate’ or ‘.Stale’ instead of ‘State’. There is a method to measure distance between words numerically. It is called Levenshtein distance (or edit distance) and we can use it easily in python. Here’s full code for text recognition.

from nltk.metrics import edit_distance
import string
import cv2
from PIL import Image

word_dict = ['State','Cargo','Ship preference','Zip Code','City','Address 1','Address 2','Cargo preference','Shipping date']

def findClosestPhrase(img, word_dict):
    img = np.array(img) 
    alpha = img[:,:,3] # extract it
    img = ~alpha   # invert b/w
    _, blackAndWhite = cv2.threshold(img, 140, 255, cv2.THRESH_BINARY_INV)
    img = cv2.bitwise_not(blackAndWhite)
    img_text = pytesseract.image_to_string(img, lang='eng') # read text from image
    img_text = img_text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
    words_distance = [edit_distance(img_text, x) for x in word_dict] #Levenshtein distance
    return word_dict[word_dist.index(min(words_distance))]

Methods comparison

Good things about RPA challenge is you can do it as many times as you want and secondly it shows you the rating. Having scripts for both methods I checked how accurate are they. The scripts performed 100 (updating 8000 fields) times and the result is as follows.

Method with no OCR missed to find out correctly 204 times out of 8000 updated fields (accuracy 97.45%). Tesseract supported by denoising and Levenshtein distance missed 16 times out of 8000 fields (accuracy 99.8%). Both results are pretty good, but I believe there’s still plenty to improve and 100% is in the reach.

Updating form fields

Once I have all the contract data assigned to variables I can inject JS script to the page to update the fields. I won’t be describing the method here. You can find it in previous article.

Video tutorial

You can find video version here.

Of course this challenge is not real life scenario but it includes elements you may approach on your RPA way. I’m encouraging you to give this and the previous challenge a try!


Analysis of Robotic Process Automation labour market in Poland<<
Notify of
Inline Feedbacks
View all comments