
|
View Full Version : scraping the fully rendered page, not the html ?
as you know, one can use REGEX to find specific data from a page.
however, some websites, use Javascripts to hide their data.
So the page you see in your browser, vs. the page in html format is different.
What is a possible solution? Is there any way to translate the fully rendered page, onto html, and then scraping it ?
Another difficulty is scraping flash. is it even possible to scrape texts on flash ? I do not see how its possible, unless the .swf file is downloaded, decompiled, and match for regex.....
cselzer 04-17-2009, 11:18 PM There's a way to do anything, just this isn't easy. You will need to make a special JavaScript parser, and parse it and then grab the data as it is coming, if this makes sense...
Basically, you would parse the JavaScript, but instead, you wouldn't display it, it would be directed to an output log or a variable, just depends how you want it stored, and then displayed as you want it.
It's not an easy project. Would most likely need to be done in c, or another language that can use the JavaScript library in order to parse it.
azizny 04-18-2009, 01:19 AM For Javascript, you simply download the code as HTML code and decode it (if decoded) or simply translate it to regular HTML. For any Javascript, there exist an HTML translation.
For Flash, you will have to download it and either decompile it or convert it to HTML.
Peace,
cygnusd 04-18-2009, 01:23 AM Both are very much tied to the browser rendering the content. I wouldn't could myself implementing my own server-side javascript interpreter or something.
For Javascript, I think it could be easier. You may need to build a browser plugin or interface with something like Selenium IDE (http://seleniumhq.org/projects/ide/). Selenium can automate task and I think its flexible enough to be used for scraping.
For Flash, I think its hopeless. A brute force idea would be to grab the screenshot, cut the browser window area (challenge is for scrollable content), and then run the images into an OCR to get the text. Good luck in automating this, especially when the flash content is interactive which is usually the case.
:D
LOL @ OCR. i actually tried to do that....but realized how poor OCR is right now....well at least from what I could find on open source sites...
btw SELENIUUM site is 404
cygnusd 04-18-2009, 01:53 AM remove the extra parenthesis.
http://seleniumhq.org/projects/ide/
mwatkins 04-18-2009, 03:52 AM Worth a look at:
http://simile.mit.edu/wiki/Crowbar
i think basically CrowBar is what I intend.
however, it requires FF....
it would be nice if there was a complete command line solution.
for instance
$home> parse http://www.somesite.com > index.html
$home> perl scraper.pl index.html
==== Showing Scraped Data ====
Somesite.com
bunch of outputs here.
$home>
where parse.exe would basically render that page, as a web browser would. then stdout to a html file, which is picked up by a perl scraper, that extracts and echo's out the desired data.
perhaps links2 would work for this task?
larwilliams 04-18-2009, 07:13 PM perhaps links2 would work for this task?
This is why I consider JavaScript to be a PITA. There is no reliable way to parse it and capture the completely rendered page as the browser sees it, without a miracle :D
foobic 04-18-2009, 10:07 PM i think basically CrowBar is what I intend.
however, it requires FF....
it would be nice if there was a complete command line solution.
What you want (rendering a page including js) requires a browser. You can either use an existing one or create a new one - which do you think would be quicker / easier / more reliable? ;)
FF is the obvious choice because:
it's open source.
it's popular, so most web sites can be expected to work on it.
mwatkins 04-18-2009, 10:29 PM And FF / Mozilla is available on Unix/Linux.
It won't be terribly fast, but it'll be workable. If you get really tricky you could set up a pool mozilla instances you can fire page render requests off to.
Hopefully all this effort isn't going into producing a better mouse trap... for collecting email addresses! ;)
you are right maybe i was being too greedy. i guess setting up a dozen vps and running FF with crowbar should do it.
lol. the email list market is so saturated....harvested emails not worth the time.
tim2718281 05-15-2009, 03:09 PM What is a possible solution? Is there any way to translate the fully rendered page, onto html, and then scraping it ?
Sure, you can do it on the client side. Firebug can display the HTML for the page Firefox is displaying.
mwatkins 05-16-2009, 09:45 PM jjk2 - this is slightly off topic but given all the talk about screen scraping lately, I could not help but think of you and your business.
What's off topic is this example of "targeted scraping" that I cooked up. Have a look at the example URL in the second example - check out the HTML source at that URL and then compare that to the final output of the queried source at the end of the post.
http://forums.freebsd.org/showthread.php?p=24263#post24263
Maybe techniques like jquery and its Python cousins, or working directly with lxml - very fast and fairly forgiving - might come in handy for you too one day.
thank you. your help is greatly appreciated!
HivelocityDD 05-17-2009, 05:39 AM With Firefox there are many addons which you can check which will do the job the on the client side. Generated html code can be seen using certain plugins/addons of firefox.
|