PowerHome Messageboard: URL scraper

Active Topics

Memberlist

Help

PowerHome General
PowerHome Messageboard : PowerHome General

Topic: URL scraper

Author

Message

<< Prev Topic | Next Topic >>

raven77
Groupie
Groupie

Joined: January 02 2007
Location: United States
Online Status: Offline
Posts: 44

Posted: August 08 2011 at 09:58 \| IP Logged

can someone give me a quick explanation on how to implement this? I would like to use this new feature to pull temp data from the CAI networks webcontrol.
Thanks!

dhoward
Admin Group

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4447

Posted: August 09 2011 at 23:05 \| IP Logged

Raven,

I had totally forgotten that I had included the URL Scraper plugin with the beta and meant to document it sooner. In the meantime, here are some quick basics to get going:

In the PowerHome Explorer under PowerHome|Setup|Plugins, create a new plugin line. Give it a suitable ID. Under "Launch Data (ActiveX classname)", use:

PH_URLScraper.phurlscraper

For "Initialization Data", enter the path and filename for the urlscraper.ini file. This will typically be:

c:\powerhome\plugins\urlscraper.ini

Edit the urlscraper.ini file for the URL and data you're trying to get. The default file is currently set to scrape weather data for zipcode 32712. The regular expression syntax is the standard VBScript version that is best documented here: http://www.regular-expressions.info/vbscript.html.

When the plugin scans the URL and finds a regex match, it will fire the appropriate trigger. The regex search for section [URL_1_1] will fire generic plugin trigger for the ID, Command 1, Option 1. [URL_1_2] will be Command 1, Option 2. [URL_2_3] will be Command 2, Option 3, etc. The scraped data will be concatenated data together into system variable [TEMP5]. The individual elements will be separated by "<|>". If you have 10 or fewer elements in a single regex expression, the individual elements will also be returned in the [LOCAL1] thru [LOCAL10] variables.

I hope this gives you enough info to get started.

Dave.

Edited by dhoward - August 09 2011 at 23:06

GadgetGuy
Super User

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 942

Posted: November 08 2014 at 08:38 \| IP Logged

Dave-

I am about to attempt use of this URLScraper, but find that more info on its "care & feeding" is needed.

Specifically, I see in the .ini file the following parameters that I'm not sure how to configure.

Can you clarify ...
freq=
scrapecount=
regexoccur=
regexflags=

Thanks.

__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!

dhoward
Admin Group

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4447

Posted: November 08 2014 at 12:33 \| IP Logged

Ken,

Probably easiest to explain in terms of the actual sample posted below:

Code:

[config]

urlcount=1

[URL_1]

url=http://www.wund.com/cgi-bin/findweather/getForecast?quer y=Eindhoven

freq=0.5

scrapecount=2

[URL_1_1]

regexsearch=<div id="main">[\s\S]*?<span>(.+)</span>[\s\S]*?<h4>(.+)</h4>[\s\S]*?<label>Wind:</label>[\s\S]*?<span>[\s\S]*?<span>(.+)</span>[\s\S]*?from[\s\S]*?<span>(.+)</span>[\s\S]*?<label>Dew Point:</label>[\s\S]*?<span>(.+)</span>

regexoccur=1

regexflags=0

[URL_1_2]

regexsearch=<label>Pressure:</label>[\s\S]*?<b>(.+)</b>[\s\S]*?<label>Windchill:</label>[\s\S]*?<span>(.+)</span>[\s\S]*?<label>Humidity:</label>[\s\S]*?<div class="b">(.+)</div>[\s\S]*?<label>Visibility:</label>[\s\S]*?<span>(.+)</span>

regexoccur=1

regexflags=0

You'll start with the urlcount under the [config] section. This determine how many unique URL's will be retrieved (a single instance of the plugin can retrieve multiple different URL's). For each URL in the URL count, you'll have URL sections. For 1 URL count, you'll have a [URL_1] section. If you have a count of 2, you'll have both a [URL_1} and [URL_2] section.

Within a [URL_X] section, you'll have the url, the freq (the frequency in minutes for how often to retrieve the URL), and the scrapecount. The scrapecount is how many regex searches are going to be made against the retrieved URL HTML data. For [URL_1] with a scrapecount of 2, you'll have both a [URL_1_1] and [URL_1_2] section. If you have a [URL_2] section with a scrapecount of 1, then you'll have a [URL_2_1] section.

The [URL_X_Y] section defines a regex search for the URL and fires a generic plugin trigger. The "X" value corresponds to the Trigger ID column (Command 1 for an X value of 1) and the "Y" value corresponds to the Trigger Value column (Option 1 for a Y value of 1). The regex search that is done uses the VBScript regular expression engine (the same that is used in the new ph_regex???2 functions) and full details on the protocol can be found here: http://www.regular-expressions.info/vb.html. The regexoccur and regexflags parameters correspond to the occur and flags parameters respectively as documented in the PH help file for the ph_regex2 function.

Hope this helps,

Dave.

If you wish to post a reply to this topic you must first login
If you are not already registered you must first register

Printable version

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum