LinkExtractor {Rcrawler} | R Documentation |
LinkExtractor
Description
Fetch and parse a document by URL, to extract page info, HTML source and links (internal/external). Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering.
Usage
LinkExtractor(url, id, lev, IndexErrPages, Useragent, Timeout = 6,
use_proxy = NULL, URLlenlimit = 255, urlExtfilter, urlregexfilter,
encod, urlbotfiler, removeparams, removeAllparams = FALSE,
ExternalLInks = FALSE, urlsZoneXpath = NULL, Browser,
RenderingDelay = 0)
Arguments
url |
character, url to fetch and parse. |
id |
numeric, an id to identify a specific web page in a website collection, it's auto-generated byauto-generated by |
lev |
numeric, the depth level of the web page, auto-generated by |
IndexErrPages |
character vector, http error code-statut that can be processed, by default, it's |
Useragent |
, the name the request sender, default to "Rcrawler". but we recommand using a regular browser user-agent to avoid being blocked by some server. |
Timeout |
,default to 5s |
use_proxy |
object created by httr::use_proxy() function, if you want to use a proxy to retreive web page. (does not work with webdriver). |
URLlenlimit |
interger, Maximum URL length to process, default to 255 characters (Useful to avoid spider traps) |
urlExtfilter |
character vector, the list of file extensions to exclude from parsing, Actualy, only html pages are processed(parsed, scraped); To define your own lis use |
urlregexfilter |
character vector, filter out extracted internal urls by one or more regular expression. |
encod |
character, web page character encoding |
urlbotfiler |
character vector , directories/files restricted by robot.txt |
removeparams |
character vector, list of url parameters to be removed form web page internal links. |
removeAllparams |
boolean, IF TRUE the list of scraped urls will have no parameters. |
ExternalLInks |
boolean, default FALSE, if set to TRUE external links also are returned. |
urlsZoneXpath |
xpath pattern of the section from where links should be exclusively gathered/collected. |
Browser |
the client object of a remote headless web driver(virtual browser), created by |
RenderingDelay |
the time required by a webpage to be fully rendred, in seconds. |
Value
return a list of three elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived internal urls and the third is a vetcor of external Urls.
Author(s)
salim khalil
Examples
## Not run:
###### Fetch a URL using GET request :
######################################################
##
## Very Fast, but can't fetch javascript rendred pages or sections
# fetch the page with default config, then returns page info and internal links
page<-LinkExtractor(url="http://www.glofile.com")
# this will return alse external links
page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE)
# Specify Useragent to overcome bots blocking by some websites rules
page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE,
Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",)
# By default, only HTTP succeeded page are parsed, therefore, to force
# parse error pages like 404 you need to specify IndexErrPages,
page<-LinkExtractor(url="http://www.glofile.com/404notfoundpage",
ExternalLInks = TRUE, IndexErrPages = c(200,404))
#### Use GET request with a proxy
#
proxy<-httr::use_proxy("190.90.100.205",41000)
pageinfo<-LinkExtractor(url="http://glofile.com/index.php/2017/06/08/taux-nette-detente/",
use_proxy = proxy)
#' Note : use_proxy arguments can' not't be configured with webdriver
###### Fetch a URL using a web driver (virtual browser)
######################################################
##
## Slow, because a headless browser called phantomjs will simulate
## a user session on a website. It's useful for web page having important
## javascript rendred sections such as menus.
## We recommend that you first try normal previous request, if the function
## returns a forbidden 403 status code or an empty/incomplete source code body,
## then try to set a normal useragent like
## Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",
## if you still have issue then you shoud try to set up a virtual browser.
#1 Download and install phantomjs headless browser
install_browser()
#2 start browser process (takes 30 seconds usualy)
br <-run_browser()
#3 call the function
page<-LinkExtractor(url="http://www.master-maroc.com", Browser = br,
ExternalLInks = TRUE)
#4 dont forget to stop the browser at the end of all your work with it
stop_browser(br)
###### Fetch a web page that requires authentication
#########################################################
## In some case you may need to retreive content from a web page which
## requires authentication via a login page like private forums, platforms..
## In this case you need to run \link{LoginSession} function to establish a
## authenticated browser session; then use \link{LinkExtractor} to fetch
## the URL using the auhenticated session.
## In the example below we will try to fech a private blog post which
## require authentification .
If you retreive the page using regular function LinkExtractor or your browser
page<-LinkExtractor("http://glofile.com/index.php/2017/06/08/jcdecaux/")
The post is not visible because it's private.
Now we will try to login to access this post using folowing creditentials
username : demo and password : rc@pass@r
#1 Download and install phantomjs headless browser (skip if installed)
install_browser()
#2 start browser process
br <-run_browser()
#3 create auhenticated session
# see \link{LoginSession} for more details
LS<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php',
LoginCredentials = c('demo','rc@pass@r'),
cssLoginFields =c('#user_login', '#user_pass'),
cssLoginButton='#wp-submit' )
#check if login successful
LS$session$getTitle()
#Or
LS$session$getUrl()
#Or
LS$session$takeScreenshot(file = 'sc.png')
#3 Retreive the target private page using the logged-in session
page<-LinkExtractor(url='http://glofile.com/index.php/2017/06/08/jcdecaux/',Browser = LS)
#4 dont forget to stop the browser at the end of all your work with it
stop_browser(LS)
################### Returned Values #####################
#########################################################
# Returned 'page' variable should include :
# 1- list of page details,
# 2- Internal links
# 3- external links.
#1 Vector of extracted internal links (in-links)
page$InternalLinks
#2 Vector of extracted external links (out-links)
page$ExternalLinks
page$Info
# Requested Url
page$Info$Url
# Sum of extracted links
page$Info$SumLinks
# The status code of the HTTP response 200, 401, 300...
page$Info$Status_code
# The MIME type of this content from HTTP response
page$Info$Content_type
# Page text encoding UTF8, ISO-8859-1 , ..
page$Info$Encoding
# Page source code
page$Info$Source_page
Page title
page$Info$Title
Other returned values page$Info$Id, page$Info$Crawl_level,
page$Info$Crawl_status are only used by Rcrawler funtion.
## End(Not run)