screen scraping : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

screen scraping

menu
Web-Based Screen Scraping	Historical Screen Scraping
Tips

Web-Based Screen Scraping

Screenscraping (or now more often called webscraping) also refers to extracting information from HTML (Hypertext Markup Language) web pages on the web. Unless the authors permit reuse, you are violating copyright by doing that. I got in trouble by screenscraping foreign exchange rates off the Oanda site. Even material that looks fair game for reuse, e.g. prices, is not necessarily so. It is a legal minefield. It seems that manually extracting information is considered less sinful that using a program to do it, but you can still get in trouble.

Even if you screen-scrape for non-commercial purposes, even if you don’t repost the data and even if you don’t put much load on their server, they can still get irate, block you and send lawyer letters. I think the main reason is they put up their site primarily to serve ads, (the data offered are just bait) and you obviously are not reading the ads if you are screenscraping. If you stick to government sources, you will likely be safe.

Before you launch on a screen scraping project, do a thorough search of the site for a downloadable version of the data, sometimes in CSV (Comma-Separated Value) format, or spreadsheet format or SOAP/XML (extensible Markup Language) . This is ever so much more convenient and stable, not to mention quick. You just download the information you need, not a ton of copy to induce sales. If there is no such download, it never hurts to ask the source to provide one. It somehow never occurs to data providers that data are almost useless unless they are in computer-friendly format, i.e. not HTML.

Once you have your downloaded page,String.indexOf and regexes are useful tools to extract the data. Usually the data are too malformed to use a straightforward HTML parser. TagSoup can be useful to tidy up mangled HTML syntax prior to simple-minded programs sifting through the data.

Tips

Keep your regexes simple. Don’t try to extract more than a couple of fields per regex. The problem is websites are almost never consistent and you will have to keep adjusting and adjusting your regexes to handle more and more exceptions. If you try to collect too much information per regex, the regex becomes to complicated and fragile.
Don’t try to do everything with regex. Use ordinary Java string handling to polish the results.
Write code to collect the website pages and store them locally. Then work on writing code to analyse them. This approach has the advantage of saving the overhead of repeatedly collecting the web pages and it allows you to use global search and replace tools to discover the patterns in the pages. Further you are not going to annoy their servers with repeated hits of the same data.
Often there are several sources of the same information. Look them over. See which are the most accurate, complete, consistent, maintained and easy to parse before you decide which one to use.
Screenscraping is very frustrating. You have everything working, then the vendor will change the format, introduce ads for related products, find another 5 ways to say that there is no such product, or decide to totally obfuscate everything with JavaScript. The vendor may decide that actually using the data is cheating and block your access. Screen scraping programs require constant maintenance and can stop working at any point. At any point you may have to throw in the towel. If there is any other way to get the data, it is better than screenscraping.
I use a clues based approach to screenscraping bookstores. For each bookstore I have a list of strings and regexes to look for in the result and postitive or negative points associated with each string. The presence of some strings hint the book in is stock. Some hint it is out of stock. Some hint the book has been replaced by a newer edition etc. Kobo requires 12 strings, Powells requires 17 strings Barnes & Noble requires 27 strings, Chapters Indigo requires 32 strings. And further, the strings change every month or so. People like variety. Computers hate it. I have asked about 30 bookstores to insert a consistent icon or piece of text to indicate whether the book is in stock, both to make it easy for people and computers, but all refused. They don’t want to make it easy. They want customers to linger. They are suspicious of screenscrapers, even when screenscrapers are affiliates trying to send them business.
JavaScript is a royal pain in the ass. It is as if its primary purpose is to foil screen scraping. There are two different pages:
- The raw page from the server which is what your screenscraper will see.
- What you can inspect on the browser screen after JavaScript has decoded the page.
Searching for strings that JavaScript generates will get you nowhere. Instead of looking for the generated strings, you have to look for the raw data JavaScript uses, e.g. error messages numbers. In theory, there should be some way to run JavaScript outside the browser on the page so your screen scraper too can see the decoded version. (See HtmlUnit)
For screenscraping sites in languages you do not speak, using Google Chrome with automatic translate is invaluable. Pretty quickly you learn the key vocabulary.
Websites often use JavaScript to obfuscate what the web page is doing and to foil webscraping. Instead of trying to unravel the JavaScript, just monitor the HTTP (Hypertext Transfer Protocol) traffic with Wireshark and emulate those transactions using the CMP (Canadian Mind Products) HTTP package.

Historical Screen Scraping

In the olden days, screen scraping ran scripted client software which interacts with legacy green screen applications e.g. CICS 3270 terminal apps and (through the scripting) can return data to a host component. The host component can make the data available to non-legacy apps through ODBC (Open Data Base Connectivity), JDBC (Java Data Base Connectivity), etc.

The Screen scraper program has to fool the host into thinking it is talking to one of its usual hardware terminals with an operator sitting at it. It must compose queries in the format the usual hardware would produce and interpret the formatted data coming back, parsing it to extract the data and leave behind the formatting.

However, today screen scraping is much simpler. You have to emulate a browser and the server sends you HTML.

Before you leap into writing an old-tyme screen scraper, investigate thoroughly all the possible terminals you might emulate that will work with the existing app. You might find some simpler to emulate than others.

Some of the old terminals had quite complex protocols, e.g.SDLC (Synchronous Data Link Communication) (Synchronous Data Link Communication), so you usually you don’t want to write that part from scratch. Look for a third party library to handle the low-level protocol details.

Screen scraping can also refer to capturing a bit image off the screen the program is running on using Robot.createScreenCapture.

To convert the pixels back to text is not quite as difficult as you might think. You can do a primitive OCR (Optical Character Recognition) that just compares clip regions with a cast of prototype characters set in the same font and size looking for an exact match. You might want to adjust colours to pure black and white before you start. This is quite a bit easier than realOCR (Optical Character Recognition) where you have to deal with imprecisely formed characters.

To separate characters you have to look for a vertical strip of white. To rapidly find the matching character you could use several methods:

You might normalise your comparisons by creating a tight bounding box around each character. If the bounding box sizes don’t match exactly there is no point in comparing the bit contents. You presort letters by bounding box size chain them off that size and only compare against potential matches.
Do a hash on the bits of the character and use a HashMap lookup.
Create a set of probe points on the character and see if you can arrange the probe points such that every character has a unique pattern using a minimum number of probe points. Just look up the pattern (bits considered as int) in an array to get the corresponding character. To get your set, compute for each probe point calculate how many letters it can discriminate. Take the #1 discriminator as your first probe point. Then calculate how many letters each of the remaining probe points can discriminate that the #1 point can’t. This is your #2 point, etc. Then manually tweak interfering at each stage.

CGI
HTML4
HtmlUnit: browser automation, including JavaScript
HTTP
HTTP read/download package: useful core to screenscraping app
parser
regex
schema.org
TagSoup
Wireshark

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/screenscraping.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\screenscraping.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[18.221.239.148]
Feedback	You are visitor number