Statistics ++

Under Construction

Defining Variable in R Package Environment — September 1, 2017

Defining Variable in R Package Environment

Sometimes it is required to define customized function in specific package’s environment (in R) in order to utilize the best out of the package. I’ll quickly describe here the intention here with immediate problem (and finally a workable solution obviously).

Consider plyr package in R. If you have used you might know there is a nice progress bar option available with all of these apply series of functions which works in non-parallel mode.

Try this out

ldply(seq(10), function(i) {Sys.sleep(0.1);data.frame(x = rnorm(1), y = rnorm(1))}, .progress = "text")

Now if I like to add my own progress bar, then things is little complicated. There are options and customization but not exactly a customized progress bar. So here is how you can do the same.


progress_print <- function(){
L <- list(init = function(x) {M <<- x}, step = function() NULL, term = function() NULL)
itr <- 0
M <- 1
L$step <- function(){
itr <<- itr + 1
cat(paste0("Job ", itr, " out of ", M, "\n"))
}
return(L)
}
# this will give error
assign("progress_print", value = progress_print, envir = environment(llply))
# so we have to do like this
environment(progress_print) <- environment(llply)
# here is one usage of above
ldply(seq(10), function(i) data.frame(x = rnorm(1), y = rnorm(1)), .progress = "print")

view raw

demo.R

hosted with ❤ by GitHub

https://gist.github.com/bedantaguru/1004cd27774436ca334ee0d44569d918#file-demo-r

Look at the line (copied here from above gist)

environment(progress_print) <- environment(llply)

This is the trick here.

 

R-Studio GitHub and Proxy — January 16, 2017

R-Studio GitHub and Proxy

Normally people don’t face problem in connecting GitHub and committing code to their repositories. However I work in a environment with proxy based supervision being enabled across entire domain, which makes things little complex for a coder. Fortunately some organizational proxy provides access to GitHub and it’s sub-domains. Given this I started thinking of using GitHub as my version controlling system.

There are broadly two kinds of network restrictions usual organization or companies impose. These are

  1. Proxy : An internal proxy server is being used to block desired websites. Lot of organization, specifically those who can afford to employ external vendor, use Automatic Proxy Configuration Script [pac file].
  2. VPN : Usually all network traffic tunneled though VPN making things little easier provided target site is open under VPN.

May of us may use proxy script (rather provided by the organisation where one belongs). Here in this technical document I’ll discuss ways to use this proxy with different services that codes may require.

Getting the Proxy

First I’ll show how one can get proxy configuration script [usually which gives different proxy based on site one try to access] and parse it to get a usable proxy. By means a usable proxy I mean  either of HTTP or HTTPS or SOCKS proxy.

Steps to get usual proxy from PAC file:-

  1. In windows open “Internet Options” [from internet explorer setting and then connection tab] [or search for proxy in Windows 8 or later]blog
  2. Get the script
    In my case it’s like http://xxx.xx.xx.xx:8443/proxypac.pac [IP hidden].
  3. Open this in normal browser. This will download the pac file.
  4. Open that pac file in normal text editor. [If you have Notepad++ use language as JavaScript which will enable you more readability] And then figure out what will be actual proxy for a specific website. If you don’t want to see at huge code [though it’s simpler as per my view] you can use PAC parser [online version or through coding library]. [Personally I prefer online method [unless it’s blocked or the service is unavailable]. Just paste the code and desired website and get the proxy.]

Using the Proxy

Now you have a usual proxy [in my case it’s http proxy like 172.xx.x.xx:3128] for GitHub obtained from above method. Let’s proceed to using it. Here in my attempts section I described the path in which I arrived in the final conclusion, which may be skipped and final implementation alone can be studied.

Attempts to use GitHub with R-Studio

I started with R-Studio official guide and a really helpful r-bloggers post in my R-studio server setup [base OS was CentOS 6.5]. I had access to other two systems at a same time [through remote access and physical access] [one is my local work PC which also governed under same proxy and another my Home PC without any proxy boundaries]. I tried instruction set given in above mentioned documents in my home PC and it worked like a charm. But in server I faced first blockage in DNS settings. Well when you use ssh for committing to GitHub you need to have github.com resolved correctly in local terminal [OMG what is all these, you might be thinking. Don’t worry it’s a basic networking concept most of us don’t need to know as it’s generally meant for network support guys. But being a technologically not handicapped I always prefer to resolve system level requirement by myself. I’ll give you a basic idea about the concept. ].

In Linux server you can check the “host-name resolve configuration” by issuing following command in terminal

cat /etc/resolv.conf

Usually you’ll see something like this

# Generated by NetworkManager
search your_domain.com
nameserver 172.xxx.xxx.xxx

Now the IP given after nameserver is the IP where name resolution happens. Means actually global physical IP of any website is being retrieved from a directory kind of service hosted in that IP. As for example when you issue following command

ping github.com

it goes to the nameserver and retrieves the IP for github.com. Here is and example output in windows [with no proxy limitations]

blog

Now here in this organization the nameserver does not even have a entry for github.com. Which means when you run “ping github.com” in proxy enabled system it is not able to locate actual IP address corresponding to github.com.

So I thought of using proxy to resolve the names [alphabetical address of a IP, here it’s github.com we are trying to resolve]. I got one software in windows named Proxifier which does the job for you. But I did not get anything ready made in Linux.

Then I researched a bit for alternatives to DNS over proxy. After a short research I thought why don’t I try to host a nameserver locally. Well that gave me some idea which may be useful for someone. I got several links for this topic. Out of all these links one can consult blog in howtoforge.com and in digitalocean.com. I started reading these documents but I felt that it’s little too much for simple requirement to be fulfilled.

Then I tried to seek other ways to get things going. Then I got an idea from one “Stack-Overflow question answer topic” where someone suggested to use http commit instead  of ssh commit. This gave a a light as http requests will pass through curl like tools which will in tern access proxy server and hence there is a chance that the name resolution problem will not occur.

This indeed worked in one of my work PC which runs on Windows OS. But unfortunately it did not run on my Linux server.

I started finding the root cause of the problem. After a helpful topic from GitHub, I got the idea that I need to upgrade the git-shell in order to make the commit work smooth over http. The GitHub link leaded me towards Git Download Page, where I got the information about IUS Community Project. But unfortunately after installing IUS repository when I issue command like

yum install git

it did not update git to desired version of git-shell. So I Googled a bit for another solution which gave me a nice blog on the same topic. But unfortunately repoforge.org was down. So I left with only one solution that is to compile git from source.

Finally I am able to update the version of git by compiling it from the source. I had followed a really detailed blog on this [though I had to remove old version of git]. And after the same I was able to commit in GitHub from R-Studio. The gist and step by step instruction is given in next section. I guess this method will work in all the cases even if the proxy is such a way that DNS can resolve the name. It’s always better to have a generic method which works in most of the scenarios.

Use GitHub with R-Studio over Proxy

Here are the steps to be followed in Linux environment [with few minor alteration same can be adopted in windows. Windows git up-gradation is not a major problem and can be done easily.]

  1. Step 1:Firstly in console remove the git if the version of the git is below 1.7.10
    To check git version run

    git --version

    To remove use something like [I’m in CentOS. If you are in different OS check the uninstall command]

    yum remove git
  2. Step 2: Follow this post. Only while checking the version use above command instead of “git -version” [two “-” before version is required]
  3. Step 3:Set global parameters as follows [these commands to run in console]
    Set your username and e-mail id for GitHub commits

    git config --global user.email "nil.gayen@gmail.com"
    git config --global user.name "bedantaguru"

    Set proxy [as detected from above section]

    git config --global http.proxy http://172.xx.x.xx:3128
    git config --global core.gitproxy "git-proxy"

    Now we are all setup to follow tutorials like r-bloggers.com Blog.

  4. Step 4: Now one can follow this blog [it is indeed a nice blog and well documented process flow]. Only when the blog writer suggested to use “git config remote.origin.url”, same can be ignored. As I have seen commit works without that with the latest git-shell which was compiled in Step 2. Remember you have to commit at least once locally. here is an example of what I did
    blog-1
  5. Step 5: When you use http commit method you’ll get promoted with username and password even if you have set that on global level. Provide these details for authentication.
    blog-2

    After this you’ll be able to complete the commit. here is a screen shot of the same.

    In the console
    blog-3
    In the browser
    blog-4

This concludes the steps involved in using GitHub web account in R-Studio. It’s required when you want to host your code in GitHub. If you are only interested in local version control perhaps you don’t need anything of these.

Download Links:

  1. Git for Windows
  2. GitHub Desktop
  3. Git Source

Reference:

  1. GitHub Official Support Page : Using SSH over the HTTPS port
  2. R-Studio Official Blog : Version Control with Git and SVN
  3. Really Great r-bloggers.com Blog.
  4. Linux: Setup as DNS Client / Name Server IP Address
  5. Stack Overflow Questions : Q1 Q2 Q3
A Quick Guide to Practice R, Pyhton and other Terminal Based Programming —

A Quick Guide to Practice R, Pyhton and other Terminal Based Programming

When it comes to R, people sometimes get confused about which tutorial to start with as there are millions of options. One can Google for “R Tutorial” and lots of websites will be open to him/her to learn from. Same things goes for Python too [in fact it’s true for any well known open source tools]. Here I’m going to write down few best suitable setup based on the environment from where one is learning.

Learning Environment with IDE

Well obviously to learn R or Python you’ll need and IDE installed in your system from where you are learning. For R my suggestion and I think everybody’s suggestion will be to choose R-Studio. Also do get handy with Rseek.org [this provides a customized search for R related functions]. For Python I generally suggest Anaconda [version > 3 if you are starting with Python], which is one of the very popular distributions of Python and related things required for a Data-Scientist.

Learning Environment without IDE

Above is true for situations when one is learning in Home or in Work [with admin rights possibly or with an active assistance from systems teams who can support in installation of such IDEs]. What if you are operating from your friend’s laptop or students laptop where you don’t have full control and neither of the IDE is installed for practice? In such cases online freely available terminal emulator comes very handy. Also there are couple of tutorial websites who provides free course together with web-console for subject under study. And there are couple of websites which may not provide tutorial [or tutorial is optional] but provides terminal for practicing. Here I’ll list couple of such websites which I came across [all of them had free options at least the time of writing this blog].

List of websites providing R-Console for practice

  1. DataCamp :Provides Free [optionally paid if one opt for certification or selected courses] course to both R and Python where users will get a fully functioning R or Python terminal to practice.
  2. TryR : Similar to DataCamp but here terminal is little slow as compared to DataCamp. On the other hand this can be used even without logging in.
  3. Coding Ground : This is different from above two. It does provide you freely without any log-in access, not only R but almost all of the major terminal based tools. Tutorial is attached to it but entirely optional.
  4. ideone.com : This is more like coding ground but there is no tutorial attached [both Python and R is available].

There are other  websites also for Python alone. Here are few listing:

  1. PythonAnyWhere : Fully functional IPython terminal.
  2. Repl.it : It provides compiler and interpreter to many languages including Python and Python 3 [though I have not got R in it. Mostly these are computer science related languages like C, Java etc.]
  3. College of the Holy Cross hosted Python interpreter.

Apart from these sometimes if one has to practice Linux commands may try Webminal. [There are other choice as well but usually Webminal is used by a lot]

Tutorials Required in Either Environments

In either case when you have access to R or Python terminal or IDE [local or web based], you’ll need practice and tutorial materials. As already mentioned couple of them in above sections, still I’ll write few here corresponding to R and Python. [One thing about Python related listing is that there may be better web-resources to start with depending on the usage of Python as Python is not only developed for Data-Scientist also for core developers. Here the listing are more suitable for a Data-Analyst]

  1. R : TutorialsPoint Programiz Try R Quick-R DataCamp  R-Studion Learning
  2. Python : Programiz DataCamp

If someone wants to do proper course with a certification on paid basis, can opt for Coursera.

Hope this helps. Happy learning.

Few Amazing R Functions — January 5, 2017

Few Amazing R Functions

I was trying to build two R packages. One of which is for web-crawling and another scheduling R codes in a cluster environment. During the planing phase I was searching for some really cool features of R. This research has given me numerous amazing functions. I would like to share few of them here. These can have different appliances which are entirely diverse.

Here let me first discuss the codes of the function:-


get_current_function_call_hierarchy<- function(only_function_names = T){
call_list <- sys.status()$sys.calls
call_list <- call_list[-length(call_list)]
call_list <- call_list[-length(call_list)]
if(length(call_list)==0){
return(character(0))
}
if(!only_function_names){
return(as.character(call_list))
}
return(call_list %>% lapply(as.character) %>% lapply("[[",1) %>% unlist())
}


f <- function(test = T){
h<- function(){
get_current_function_call_hierarchy()
}
j<-function(){
h()
}
k<- function(){
j()
}
if(test){
h()
}else{
k()
}
}


f <- function(x){
x+1+slow_inside()
}
x <- f
for(i in 1:5){
print(x(1))
x<-slow_me(x)
}


slow_me<- function(f){
# making clone of the function otherwise same object will refer in following line
slow_function <- f
h <- function(…){
slow_function(…)
}
return(h)
}
slow_inside <- function(){
f_c_now <- g()
num <- table(f_c_now)["slow_function"] %>% as.numeric()
if(!is.na(num)){
return(num)
}else{
return(0)
}
}

view raw

slow_function.R

hosted with ❤ by GitHub

Now as you run f() and f(test= FALSE) from following file

sample_use(get_current_function_call_hierarchy).R.

You will get following result

> f()
[1] "f" "h"
> f(test = FALSE)
[1] "f" "k" "j" "h"

I think you get the idea. I was planning to use it for logging purpose. Suppose you have a complex function sequence and you want to trace where a function broke in case of failure, You can easily make use of this function.

For second set of functions name may be confusing [slow_down function] . But if one sees what it’s meant for then I think it will be clear. Suppose I have a function which I would like to write in my own way but would like to have a special feature which is to increase let say execution delay of certain code [sometimes comes handy in web-crawling, when you fail once you may like to have more waiting time for page loading etc.]. This feature will be implemented let say by some slow down function. Which takes a function input [where a slow_inside module is present in the function body] and returns a function which on execution will understand how much to delay. If you call this function recursively it will wait as many times as the slow_me function gets called. Here in this function instead of waiting I have added the number just for representation.

So here is the output

> x <- f
> for(i in 1:5){
+ print(x(1))
+ x<-slow_me(x)
+ }
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

Hope you get the idea.

Make RSelenium work with R — October 24, 2016

Make RSelenium work with R

When it comes to web scrapping, perhaps rvest  (check out this official tutorial) is the best option available for scrapping semi static webpages (the reason why I’m calling it “semi-static” is because you can interact little-bit with dynamic webpages using rvest. For quick reference check this out).

However, if you have to crawl a page which dynamically changes it’s content based on user input and interaction then probably you’ll end up using RSelenium. Now if you see the GitHub page it clearly states that it’s meant for Selenium 2. Now Selenium 3 has been released and Firefox has been updated with some Marionette (which I have really less idea about). Accordingly I usually failed to use older code and immediately started researching about how to make use of RSelenium again on updated dependencies. So here is what you need to do to get things done.

Step 1: Get selenium standalone from Selenium Official website. Here is the direct link for selenium 3.

Step 2: Place it somewhere PATH will detect or place in a known directory. [I usually keep it in C:\Dev location.]

Step 3: Download GeckoDriver and keep in a folder (you can keep in same folder as created under step 2). Appropriate driver version has to be installed based on your OS.

Step 4: Install RSelenium in R issuing

install.packages("RSelenium")

Step 5: Use RSelenium  in R following these codes


rm(list=ls())
options(stringsAsFactors = F)
library(RSelenium)
# One can make terminal invisible but initially it helps in detecting any potential problem
# Use proper path for Selenium and geckodriver.exe.
# remember to rename the donwloaded Selenium
sel <- startServer(dir = "C:/Dev/Selenium/",
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""),
invisible = F)
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox",
extraCapabilities = list(marionette = TRUE))
remDr$open()
# test
remDr$navigate("https://www.google.com/&quot;)

Main important lines of the code are highlighted below

# in startServer
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\"")

# in remoteDriver
extraCapabilities = list(marionette = TRUE)

Here are few references form where I gathered my knowledge:

  1. Stackoverflow.com question addressing actual solution for version compatibility
  2. RSelenium : Headless browsing
  3. Phantomjs & rvest [just intro]
  4. WebDriver <-> Marionette proxy

Extras

Few other aspects are required to be considered while running R-Selenium for the first time. Frequently used options including marionette option as mention above are listed below.


sel <- startServer(dir = "C:/Dev/Selenium/",
args = c("-port 4455"),
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""),
invisible = F)
firefox_profile.me <- makeFirefoxProfile(list(marionette = TRUE,
webdriver_accept_untrusted_certs = TRUE, # for sites which has expired certificates (sometimes required for internal sites)
webdriver_assume_untrusted_issuer = TRUE, # for the same reason
browser.download.dir = "C:/temp", # download directory. However it's not having any effects as of now.
network.proxy.socks = "<proxy ip>", # for proxy settings specify the proxy host IP
network.proxy.socks_port = 3128L, # proxy port. Last character "L" for specifying integer is very important and if not specified it will not have any impact
network.proxy.type = 1L)) # 1 for manual and 2 for automatic configuration script. here also "L" is important
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4455,
browserName = "firefox",
extraCapabilities = firefox_profile.me)

For other configurable options check out GitHub Source of Selenium and after opening search in the page (using browser in page search option) for “set_preference” .

Also check out RSelenium interactive docs at rdrr.io.

 

Sending mail from R — October 23, 2016

Sending mail from R

Sending mail from R using SMTP server is a common requirement R programmers often face. I had used a simpler and extremely less secure way to do the same (which does not involve existence of  a valid e-mail id) in my early career on R. But since new security update and release of enhanced version of Gmail and other known common e-mail provider the simpler method use to fail.

Finally after a long search I was able to get similar thing working again [as of the time I’m writing this, Oct 2016]. But this time it is required to have an actual account for sending the mail (here I’m showing the case for Gmail).

This is however is a combined reproduction of following web resources [So the credit goes to the authors of these resources. I somehow combined two and made it presentable]

Source 1 and Source 2

Here are the steps:-

Step 1: Create a Gmail account and go to the settings link  for “less secure apps” (click here)

Step 2: Turn it on

capture

Step 3: Now you are all ready to do things inside R. Here is a sample code to get you started.


library(mailR)
send.mail(from="<username of sender>@gmail.com",
to="<username of receiver>@gmail.com",
subject="Test Email",
body="PFA the desired document",
html=T,
smtp=list(host.name = "smtp.gmail.com",
port = 465,
user.name = "<username of sender>@gmail.com",
passwd = "<password>",
ssl = T),
authenticate=T,
attach.files="lib.R")

It is working perfectly here is a screenshot for proof.

capture

Now you can explore mailR package for further details.

However if anyone is thinking to retrieve mails from Gmail servers here is a hint to get you started. (click here)

I have copied pasted a working code hiding credentials from the above link [credit goes entirely to the person who answered the question in Stackoverflow.com. This is tested by me and working perfectly.


# credit goes to http://stackoverflow.com/questions/4241812/how-can-i-send-receive-smtp-pop3-email-using-r
library(rJython)
rJython <- rJython( modules = "poplib")
rJython$exec("import poplib")
rJython$exec("M = poplib.POP3_SSL('pop.gmail.com', 995)")
rJython$exec("M.user(\'<username>@gmail.com\')")
rJython$exec("M.pass_(\'<password>\')")
rJython$exec("numMessages = len(M.list()[1])")
numMessages <- rJython$get("numMessages")$getValue()
# grab message number one. Loop here if you
# want more messages
rJython$exec("msg = M.retr(1)[1]")
emailContent <- rJython$get("msg")
# turn the message into a list
contentList <- as.list(emailContent)
# so we have an R list… of Java objects
# To get a more native R list we have to
# yank the string from each Java item
messageToList <- function(contentList){
outList <- list()
for (i in 1:length(contentList)){
outList[i] <- contentList[[i]]$toString()
}
outList
}
messageAsList <- messageToList(contentList)
messageAsList

view raw

get_mail.R

hosted with ❤ by GitHub

Difference in Using = and <- in R — June 9, 2016

Difference in Using = and <- in R

Many of user who are starting with R brings this question into the table “Why conventional = is not used for assignment ?”

Well there can be several posts on this. But a quick reference which shows why it’s not a good thing to assign variable by “=” operator :-

Run in command line R CMD Rserve

Now in R console try :-


library(RSclient)
con <- RS.connect()
RS.eval(con, ls())
RS.eval(con, y=10) # this is going to give problem
RS.eval(con, y<-10) # this is going to work
RS.eval(con, y) # 10

Batch Convert xls to xlsx : using R — May 11, 2016

Batch Convert xls to xlsx : using R

I had to read a bunch of similar structure excel file and combine into a single csv. I want to do that within R and I’m in a Windows 10 machine with Office 2016 [Excel + other].

Few solutions for importing data from  Excel file in R is there but out of those I usually implement gdata and openxlsx. I prefer to use openxlsx when it comes to sensitive data as it reads cell content instead of visible values [as captured by gdata].  Now the problem is some files are in xls and some files are in xlsx. Now openxlsx can only read xlsx [bad limitation]. So I need to convert all xls files to xlsx. This however can be a general requirement and require a solution to address that.

I came up with an immediate solution for me with the help of RDCOMClient package.

Here is what I coded.


library(RDCOMClient)
convert_xls_to_xlsx<- function(in_folder,out_folder, delete_xls=F){
if(missing(out_folder)){
out_folder<- in_folder
}
all_xls<- list.files(in_folder, pattern = ".xls$")
if(length(all_xls)>0){
all_xls_out<- gsub(".xls$",".xlsx", all_xls)
try({
xls <- COMCreate("Excel.Application")
lapply(1:length(all_xls), function(i){
cat(i,"\n")
wb = xls[["Workbooks"]]$Open( normalizePath(paste(in_folder, all_xls[i], sep="\\")) )
wb$SaveAs( suppressWarnings( normalizePath(paste(out_folder, all_xls_out[i], sep="\\"))) , 51)
wb$Close()
})
xls$Quit()
}, silent = T)
if(delete_xls){
all_xlsx_now<- list.files(in_folder, pattern = ".xlsx$")
test<- setdiff(gsub(".xls$","", all_xls), gsub(".xlsx$","", all_xlsx_now))
if(length(test)==0){
try(unlink(paste(in_folder, all_xls, sep="\\")),silent = T)
}
}
}
return(invisible(0))
}


# Main section of the code which is responsible for conversion is give here :-
xls ("Excel.Application")
wb = xls[["Workbooks"]]$Open( normalizePath(paste(in_folder, all_xls[i], sep="\\")) )
wb$SaveAs( suppressWarnings(normalizePath(paste(out_folder, all_xls_out[i], sep="\\"))) , 51)
wb$Close()
xls$Quit()

How I got exact number 51 [corresponds to xlsx type] while doing Save-As will be clear if you visit this link and explore the object in COM browser.

Here is what I did  to find it out.

Step 1: I went to Excel opened Visual Basic

Excel Developer Options

Step 2: Open COM Explorer [By Pressing F2]

Step 3: Search For Excel.XlFileFormat.xlOpenXMLWorkbook

VBA Object Browser

Design a site like this with WordPress.com
Get started