R: LinkExtractor

LinkExtractor {Rcrawler}

R Documentation

LinkExtractor

Description

Fetch and parse a document by URL, to extract page info, HTML source and links (internal/external). Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering.

Usage

LinkExtractor(url, id, lev, IndexErrPages, Useragent, Timeout = 6,
  use_proxy = NULL, URLlenlimit = 255, urlExtfilter, urlregexfilter,
  encod, urlbotfiler, removeparams, removeAllparams = FALSE,
  ExternalLInks = FALSE, urlsZoneXpath = NULL, Browser,
  RenderingDelay = 0)

Arguments

`url`	character, url to fetch and parse.
`id`	numeric, an id to identify a specific web page in a website collection, it's auto-generated byauto-generated by `Rcrawler` function.
`lev`	numeric, the depth level of the web page, auto-generated by `Rcrawler` function.
`IndexErrPages`	character vector, http error code-statut that can be processed, by default, it's `IndexErrPages<-c(200)` which means only successfull page request should be parsed .Eg, To parse also 404 error pages add, `IndexErrPages<-c(200,404)`.
`Useragent`	, the name the request sender, default to "Rcrawler". but we recommand using a regular browser user-agent to avoid being blocked by some server.
`Timeout`	,default to 5s
`use_proxy`	object created by httr::use_proxy() function, if you want to use a proxy to retreive web page. (does not work with webdriver).
`URLlenlimit`	interger, Maximum URL length to process, default to 255 characters (Useful to avoid spider traps)
`urlExtfilter`	character vector, the list of file extensions to exclude from parsing, Actualy, only html pages are processed(parsed, scraped); To define your own lis use `urlExtfilter<-c(ext1,ext2,ext3)`
`urlregexfilter`	character vector, filter out extracted internal urls by one or more regular expression.
`encod`	character, web page character encoding
`urlbotfiler`	character vector , directories/files restricted by robot.txt
`removeparams`	character vector, list of url parameters to be removed form web page internal links.
`removeAllparams`	boolean, IF TRUE the list of scraped urls will have no parameters.
`ExternalLInks`	boolean, default FALSE, if set to TRUE external links also are returned.
`urlsZoneXpath`	xpath pattern of the section from where links should be exclusively gathered/collected.
`Browser`	the client object of a remote headless web driver(virtual browser), created by `br<-run_browser()` function, or a logged-in browser session object, created by LoginSession, after installing web driver Agent `install_browser()`. see examples below.
`RenderingDelay`	the time required by a webpage to be fully rendred, in seconds.

Value

return a list of three elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived internal urls and the third is a vetcor of external Urls.

Author(s)

salim khalil

Examples


## Not run: 

###### Fetch a URL using GET request :
######################################################
##
## Very Fast, but can't fetch javascript rendred pages or sections

# fetch the page with default config, then returns page info and internal links

page<-LinkExtractor(url="http://www.glofile.com")

# this will return  alse external links

page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE)

# Specify Useragent to overcome bots blocking by some websites rules

page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE,
       Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",)

# By default, only HTTP succeeded page are parsed, therefore, to force
# parse error pages like 404 you need to specify IndexErrPages,

page<-LinkExtractor(url="http://www.glofile.com/404notfoundpage",
      ExternalLInks = TRUE, IndexErrPages = c(200,404))


#### Use GET request with a proxy
#
proxy<-httr::use_proxy("190.90.100.205",41000)
pageinfo<-LinkExtractor(url="http://glofile.com/index.php/2017/06/08/taux-nette-detente/",
use_proxy = proxy)

#' Note : use_proxy arguments can' not't be configured with webdriver

###### Fetch a URL using a web driver (virtual browser)
######################################################
##
## Slow, because a headless browser called phantomjs will simulate
## a user session on a website. It's useful for web page having important
## javascript rendred sections such as menus.
## We recommend that you first try normal previous request, if the function
## returns a forbidden 403 status code or an empty/incomplete source code body,
## then try to set a normal useragent like
## Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",
## if you still have issue then you shoud try to set up a virtual browser.

#1 Download and install phantomjs headless browser
install_browser()

#2 start browser process (takes 30 seconds usualy)
br <-run_browser()

#3 call the function
page<-LinkExtractor(url="http://www.master-maroc.com", Browser = br,
      ExternalLInks = TRUE)

#4 dont forget to stop the browser at the end of all your work with it
stop_browser(br)

###### Fetch a web page that requires authentication
#########################################################
## In some case you may need to retreive content from a web page which
## requires authentication via a login page like private forums, platforms..
## In this case you need to run \link{LoginSession} function to establish a
## authenticated browser session; then use \link{LinkExtractor} to fetch
## the URL using the auhenticated session.
## In the example below we will try to fech a private blog post which
## require authentification .

If you retreive the page using regular function LinkExtractor or your browser
page<-LinkExtractor("http://glofile.com/index.php/2017/06/08/jcdecaux/")
The post is not visible because it's private.
Now we will try to login to access this post using folowing creditentials
username : demo and password : rc@pass@r

#1 Download and install phantomjs headless browser (skip if installed)
install_browser()

#2 start browser process
br <-run_browser()

#3 create auhenticated session
#  see \link{LoginSession} for more details

 LS<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php',
                LoginCredentials = c('demo','rc@pass@r'),
                cssLoginFields =c('#user_login', '#user_pass'),
                cssLoginButton='#wp-submit' )

#check if login successful
LS$session$getTitle()
#Or
LS$session$getUrl()
#Or
LS$session$takeScreenshot(file = 'sc.png')

#3 Retreive the target private page using the logged-in session
page<-LinkExtractor(url='http://glofile.com/index.php/2017/06/08/jcdecaux/',Browser = LS)

#4 dont forget to stop the browser at the end of all your work with it
stop_browser(LS)


################### Returned Values #####################
#########################################################

# Returned 'page' variable should include :
# 1- list of page details,
# 2- Internal links
# 3- external links.

#1 Vector of extracted internal links  (in-links)
page$InternalLinks

#2 Vector of extracted external links  (out-links)
page$ExternalLinks

page$Info

# Requested Url
page$Info$Url

# Sum of extracted links
page$Info$SumLinks

# The status code of the HTTP response 200, 401, 300...
page$Info$Status_code

# The MIME type of this content from HTTP response
page$Info$Content_type

# Page text encoding UTF8, ISO-8859-1 , ..
page$Info$Encoding

# Page source code
page$Info$Source_page

Page title
page$Info$Title

Other returned values page$Info$Id, page$Info$Crawl_level,
page$Info$Crawl_status are only used by Rcrawler funtion.



## End(Not run)

[Package Rcrawler version 0.1.9-1 Index]