Active TopicsActive Topics  Display List of Forum MembersMemberlist  Search The ForumSearch  HelpHelp
  RegisterRegister  LoginLogin
PowerHome General
 PowerHome Messageboard : PowerHome General
Subject Topic: URL scraper Post ReplyPost New Topic
Author
Message << Prev Topic | Next Topic >>
raven77
Groupie
Groupie
Avatar

Joined: January 02 2007
Location: United States
Online Status: Offline
Posts: 44
Posted: August 08 2011 at 09:58 | IP Logged Quote raven77

can someone give me a quick explanation on how to implement this? I would like to use this new feature to pull temp data from the CAI networks webcontrol.
Thanks!
Back to Top View raven77's Profile Search for other posts by raven77
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4447
Posted: August 09 2011 at 23:05 | IP Logged Quote dhoward

Raven,

I had totally forgotten that I had included the URL Scraper plugin with the beta and meant to document it sooner. In the meantime, here are some quick basics to get going:

In the PowerHome Explorer under PowerHome|Setup|Plugins, create a new plugin line. Give it a suitable ID. Under "Launch Data (ActiveX classname)", use:

PH_URLScraper.phurlscraper

For "Initialization Data", enter the path and filename for the urlscraper.ini file. This will typically be:

c:\powerhome\plugins\urlscraper.ini

Edit the urlscraper.ini file for the URL and data you're trying to get. The default file is currently set to scrape weather data for zipcode 32712. The regular expression syntax is the standard VBScript version that is best documented here: http://www.regular-expressions.info/vbscript.html.

When the plugin scans the URL and finds a regex match, it will fire the appropriate trigger. The regex search for section [URL_1_1] will fire generic plugin trigger for the ID, Command 1, Option 1. [URL_1_2] will be Command 1, Option 2. [URL_2_3] will be Command 2, Option 3, etc. The scraped data will be concatenated data together into system variable [TEMP5]. The individual elements will be separated by "<|>". If you have 10 or fewer elements in a single regex expression, the individual elements will also be returned in the [LOCAL1] thru [LOCAL10] variables.

I hope this gives you enough info to get started.

Dave.


Edited by dhoward - August 09 2011 at 23:06
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 942
Posted: November 08 2014 at 08:38 | IP Logged Quote GadgetGuy

Dave-

I am about to attempt use of this URLScraper, but find that more info on its "care & feeding" is needed.

Specifically, I see in the .ini file the following parameters that I'm not sure how to configure.

Can you clarify ...
freq=
scrapecount=
regexoccur=
regexflags=

Thanks.


__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4447
Posted: November 08 2014 at 12:33 | IP Logged Quote dhoward

Ken,

Probably easiest to explain in terms of the actual sample posted below:
Code:
[config]
urlcount=1

[URL_1]
url=http://www.wund.com/cgi-bin/findweather/getForecast?quer y=Eindhoven
freq=0.5
scrapecount=2

[URL_1_1]
regexsearch=<div id="main">[\s\S]*?<span>(.+)</span>[\s\S]*?<h4>(.+)</h4>[\s\S]*?<label>Wind:</label>[\s\S]*?<span>[\s\S]*?<span>(.+)</span>[\s\S]*?from[\s\S]*?<span>(.+)</span>[\s\S]*?<label>Dew Point:</label>[\s\S]*?<span>(.+)</span>
regexoccur=1
regexflags=0

[URL_1_2]
regexsearch=<label>Pressure:</label>[\s\S]*?<b>(.+)</b>[\s\S]*?<label>Windchill:</label>[\s\S]*?<span>(.+)</span>[\s\S]*?<label>Humidity:</label>[\s\S]*?<div class="b">(.+)</div>[\s\S]*?<label>Visibility:</label>[\s\S]*?<span>(.+)</span>
regexoccur=1
regexflags=0


You'll start with the urlcount under the [config] section. This determine how many unique URL's will be retrieved (a single instance of the plugin can retrieve multiple different URL's). For each URL in the URL count, you'll have URL sections. For 1 URL count, you'll have a [URL_1] section. If you have a count of 2, you'll have both a [URL_1} and [URL_2] section.

Within a [URL_X] section, you'll have the url, the freq (the frequency in minutes for how often to retrieve the URL), and the scrapecount. The scrapecount is how many regex searches are going to be made against the retrieved URL HTML data. For [URL_1] with a scrapecount of 2, you'll have both a [URL_1_1] and [URL_1_2] section. If you have a [URL_2] section with a scrapecount of 1, then you'll have a [URL_2_1] section.

The [URL_X_Y] section defines a regex search for the URL and fires a generic plugin trigger. The "X" value corresponds to the Trigger ID column (Command 1 for an X value of 1) and the "Y" value corresponds to the Trigger Value column (Option 1 for a Y value of 1). The regex search that is done uses the VBScript regular expression engine (the same that is used in the new ph_regex???2 functions) and full details on the protocol can be found here: http://www.regular-expressions.info/vb.html. The regexoccur and regexflags parameters correspond to the occur and flags parameters respectively as documented in the PH help file for the ph_regex2 function.

Hope this helps,

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 

If you wish to post a reply to this topic you must first login
If you are not already registered you must first register

  Post ReplyPost New Topic
Printable version Printable version

Forum Jump
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum