Unlocking Financial Insights: How to Scrape and Analyze Google Pay Transactions with Python

Typically Google Pay transactions are not easily analyzed. Google Pay has different flavors in different countries the capabilities of the app varies. If you want to analyze your spending in

In this post, we’ll explore how to automate the process of scraping and analyzing your Google Pay activity using Python. By the end of this post, you’ll be able to extract transaction data, categorize transactions, and save the data for further analysis.

Prerequisites

Before we begin, make sure you have the following prerequisites:

Basic knowledge of Python.
Familiarity with HTML.
Python libraries: BeautifulSoup and Pandas.

You can install these libraries using pip:

pip install beautifulsoup4 pandas

The first step is to download your Google Pay activity as an HTML file. Follow these steps:

Step 1: Download Your Google Pay Activity

Open the Google Pay app on your device.
Navigate to the “Settings” or “Activity” section.
Look for the option to “Download transactions” or “Request activity report.”
Choose the time frame for your report and download it as an HTML file.

You can also look at the video.

Step 2: Parsing HTML with BeautifulSoup

We’ll use BeautifulSoup to parse the downloaded HTML content. Here’s how to do it:

from bs4 import BeautifulSoup

# Load the downloaded HTML file
with open('My Activity.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')

Step 3: Extracting Transaction Data

The Google Pay activity HTML contains transaction data within <div> elements. We’ll extract this data using BeautifulSoup, In this step we found the outer cell based on style and also leverage the regular expressions to extract the various transactions in Google Pay including “Paid, Received and Sent”.

# Find all outer-cell elements
outer_cells = soup.find_all('div', class_='outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp')
action_pattern = r'(Paid|Received|Sent)'
# Extract and store the action (Paid, Received, Sent)
# Iterate through outer-cell elements
for outer_cell in outer_cells:
    # Find content-cell elements within each outer-cell
    content_cells = outer_cell.find_all('div', class_='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1')
    action_match = re.search(action_pattern, content_cells[0].text)
    if action_match:
        actions.append(action_match.group(0))
    else:
        actions.append(None)

Step 4: Handling Date and Time

Extracting the date and time from the Google Pay activity HTML can be challenging due to the format. We’ll use regular expressions to capture the date and time:

date_time_pattern = r'(\w{3} \d{1,2}, \d{4}, \d{1,2}:\d{2}:\d{2}[^\w])'
    date_time_match = re.search(date_time_pattern, content_cells[0].text)
    if date_time_match:
        dates.append(date_time_match.group(0).strip())
    else:
        dates.append(None)

Step 5: Categorizing Transactions

To categorize transactions, we’ll create a mapping of recipient names to categories. This would help to consolidate the expenses and analyse the expenses by category.

recipient_categories = {
    'Krishna Palamudhir and Maligai': 'Groceries',
    'FRESH DAIRY PRODUCTS INDIA LIMITED': 'Milk',
    'Zomato':'Food',
    'REDBUS':'Travel',
    'IRCTC Web UPI':'Travel',
'Bharti Airtel Limited':'Internet & Telecommunications',
'AMAZON SELLER SERVICES PRIVATE LIMITED':'Cloud & SaaS',
'SPOTIFY':'Entertainment',
'UYIR NEER':'Pets'

    # Add more recipient-category mappings as needed
}

Step 6: Automating Categorization

Now, let’s automatically categorize transactions based on recipient names and prepare the data frame:

# Map recipients to categories
df['Category'] = df['Recipient'].map(recipient_categories)

# Reorder columns
df = df[['Action', 'Recipient', 'Category', 'Account Number', 'Amount', 'Date', 'Month', 'Year', 'Date and Time', 'Details']]

Step 7: Saving Data to CSV

Finally, we’ll save the extracted and categorized data to a CSV file:

# Save the data to a CSV file
df.to_csv('google_pay_activity.csv', index=False, encoding='utf-8')

Now, you have your Google Pay activity data neatly organized in a CSV file, ready for analysis!

Outcome

You can see that mapping has worked automatically and CSV has been generated. Sample output shared here for quick reference.

Conclusion

In this post, we learned how to automate the process of scraping and analyzing Google Pay activity using Python. By following these steps, you can easily keep track of your financial transactions and gain insights into your spending habits.

Feel free to share your comments and inputs.

Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – III

After lot of difficulties my 3rd post on this topic in this weekend. In my first post we saw what is sentiment analysis and what are the steps involved in it. In my previous post we saw how to retrieve the tweets and store it in the File step by step. Now we will move on to the step of Sentiment analysis.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets which are retrieved and stored in the previous post and start doing the analysis. In this post I’m going to use the simple algorithm as used by Jeffrey Breen to determine the scores/moods of the particular brand in twitter.

We will use the opinion lexicon provided by him which is primarily based on Hu and Liu papers. You can visit their site for lot of useful information on sentiment analysis. We can determine the positive and negative words in the tweets, based on which scoring will happen.

Step 1: We will import the CSV file into R using read.csv and you can use the summary to display the summary of the dataframe.

Step 2: We can load the Positive words and Negative words and store it locally and can import using Scan function as given below:

Step 3:

Now we will look at the code for evaluating the Sentiment. This has been taken from http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/. Thanks for the source code by Jeffrey.

Step 4:

We will test this sentiment.score function with some sample data.

In this step we have created test and added 3 sentences to it. This contains different words which may be positive or negative. Pass this “Test” to the score.sentiment function with pos_words and neg_words which we have loaded in the previous tests. Now you get the result score from the score.sentiment function against each sentence.

we will also try understand little more about this function and what it does:

a. Two libraries are loaded they are plyr and stringr. Both written by Hadley Wickham one of the great contributor to R. You can also learn more about plyr using this page or tutorial. You can also get more insights on split-apply-combine details here best place to start according to Hadley Wickham. You can think of it on analogy with Map-Reduce algorithm by Google which is used more in terms of Parallelism. stringr makes the string handling easier.

b. Next laply being used. You can learn more on what apply functions do here. In our case we pass on the sentences vector to the laply method. In simple terms this method takes each tweet and pass on to the function along with Positive and negative words and combines the result.

c. Next gsub helps to handle the replacements with the help using gsub(pattern, replacement, x).

d. Then convert the sentence to lowercase

e. Convert the sentences to words using the split methods and retrieve the appropriate scores using score methods.

Step 5: Now we will give the tweetsofaritel from airteltweetdata$text to the sentiment function to retrieve the score.

Step 6: We will see the summary of the scores and its histogram:

The histogram outcome:

It shows the most of the response out of 1499 is negative about airtel.

Disclaimer: Please note that this is only sample data which is analyzed only for the purpose of educational and learning purpose. It’s not to target any brand or influence any brand.

Text Mining: Google n-Gram Viewer & the word “Tamil”

What is n-Gram?

According to Wikipedia the “n-Gram viewer is a Phrase-usage graphing tool which charts the yearly count of selected n-grams (letter combinations)[n] or words and phrases, as found in over 5.2 million books digitized by Google Inc (up to 2008).”

The Reference URL:

http://books.google.com/ngrams/

My Experiment:

http://books.google.com/ngrams/graph?content=Tamil&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=

I was curious on the following to understand the impact of the Tamil in Google digitization project.

Figure 1: Courtesy Google n-Grams

Some in scripts from the book belong to 1854 with the courtesy from Google digitization project.

Figure 2: Book digitized by Google from Jaffna, Srilanka Tamil to English

Astonished by the way the way digitization has been done and the way the text mining works. Awesome. Try your hands too.

Text Mining: Intro, Tools and References

What is it?

In simple terms retrieve quality information from the text for analysis.

Where it can be used?

Analysis of emails, messages, etc.,
Analysis of open-ended surveys
Analysis of claims for fraud detection
Investigation by crawling
Spam filtering
Labeling for Machine learning
Recommendations engine

Various Stages of Text Mining:

Good tools for Text Mining (free J):

R Programming (refer to the tm package)
Gensim (Python library for analyzing plain text)
Gate (Open Source library for Text Processing 15-Year old)

Good References:

Where to get started: http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/

http://www.statsoft.com/textbook/text-mining/

http://rapid-i.com/component/option,com_myblog/show,Great-Video-Series-about-Text-Mining.html/Itemid,172/

Analysis of Cricketer “Dhoni’s 200” tweets on Twitter using R

What an innings of 200 on Day 3 in Chennai yesterday. I loved it. Just thought of exploring what people think on twitter about his 200 is what triggered me to write this blog but unfortunately it required lot of learning on using Twitter with R which I have summed it below. Irrespective of the intent behind analyzing Dhoni’s 200 data it also makes lot of business sense to analyze on trends in social media. In a bid to understand how the social media is dealing with your brand or products it’s important to analyze the data available in twitter. I’m trying to use R for fundamental analysis of tweets based on the TwitteR package available with R.

If you have not installed the twitteR package you need to use the command install.packages(“twitter”)
It will also install the necessary dependencies of that package(RCurl, bitops,rJson).
Load the twitter package using library(twitter)
In the above R console statements I tried to get the maximum tweets upto 1000, but I managed to get only up to 377 tweets. That’s the reason you are seeing n=377, otherwise it returned me error “Error: Malformed response from server, was not JSON”
If you don’t mention value of n , by default it will return 25 records which you can determine using length(dhoni200_tweets)
Next we need to analyze the tweets, so installing the Textmining package “tm”
Next step is to give the tweets which we have collected to the text mining but for doing so we need to convert the tweets into data frame use the following commands to do so:

> dhoni200_df=do.call(“rbind”,lapply(dhoni200_tweets,as.data.frame))

> dim(dhoni200_df)

[1] 377 10

> dhoni200_df
Next we need move the textdata as vectorSource to Corpus. Using the command > dhoni200.corpus=Corpus(VectorSource(dhoni200_df$text))
When we issue the command > dhoni200.corpus you will get the result “A corpus with 377 text documents”
Next refine the content by converting to lowercase, removing punctuation and unwanted words and convert to a term document matrix:

> dhoni200.corpus=tm_map(dhoni200.corpus,tolower)

> dhoni200.corpus=tm_map(dhoni200.corpus,removePunctuation)

> mystopwords=c(stopwords(‘english’),’profile’,’prochoice’)

> dhoni200.corpus=tm_map(dhoni200.corpus,removeWords,mystopwords)

> dhoni200.dtm=TermDocumentMatrix(dhoni200.corpus)

> dhoni200.dtm

A term-document matrix (783 terms, 377 documents)

Non-/sparse entries: 3930/291261

Sparsity : 99%

Maximal term length: 23

Weighting : term frequency (tf)
Analysis: When we try to analyze the words which has occurred 30 and 50 times respectively these were the results:
Analysis: I tried to analyze further the association words when we use the word “century”. The following were the results:

The term firstever seems to be of the highest with 0.61. In this command findAssocs the number 0.20 is the correlation factor.
The command names(dhoni200_df) will list you the various columns which are coming out as tweets when converted to a data frame.

[1] “text” “favorited” “replyToSN” “created” “truncated”

[6] “replyToSID” “id” “replyToUID” “statusSource” “screenName”
Analysis: Most number of tweets

> counts=table(dhoni200_df$screenName)

> barplot(counts)