Efficient way to report countries moving towards service orientation

While technology grows exponentially, certain service businesses are declining. The purpose of this work is to come up with a method that automatically reports countries that are leaning towards service orientation by capturing patterns in service data from the past. And a company can further pursue, invest and expand their service businesses into these countries.


The data is service percentages of 144 countries by year from 1995 to 2030. Data from 1995 to 2015 is being observed data while forecasts are data from 2016 to 2030. The information includes service percentage, GDP, Imports, etc. from 1995 fitting in random forest model to forecast.


To find out which country has growing service percentage, we’ll compare past pattern (observed data from 1995 to 2015) to year 2020. And we cannot compare service percentage of year 2015 to that of year 2020 straightforwardly because sometimes year 2015 has systematic low service persentage and shows big growth while in fact the past data pattern only indicates a steady increase in service percentage. So it is a good idea to come up with some method that automatically reports countries that have an increase in service percentage considering years around 2015 instead of having to go through all each 144 countries.

The method is to assign weights on each year service percentage. The most recent year (2015, 2014, 2013…) receives higher weight while older years (1995, 1996, 1997…) receives lower weight. Then select countries that have ratio of year 2020 and average service percentage after weighted greater than 2, indicating growth in service percentage.

Service percentage time plot

Assigning weight and report countries with an increase in growth of service percentage



Answering which Customer is most likely to Churn: Database Marketing

After building a model and predicting churn from new Cell2Cell customer data in my previous post, I’d like to present results and recommendations to best serve the company.


It is about 2% of Cell2Cell’s customers voluntarily churn to use competitors’ service each month. Because the company still lacks the app that increases strategy effectiveness in development side, a model that fully executes strategies is needed to keep customers loyal or to identify characteristics of customers who are likely to churn to best carry out the issues related to customer churn that could be either poor phone reception or the lack of appealing special offers. This model identifies customers who are 75% higher chance to churn and correctly predicts 67% of new customers who are likely to churn between 31-60 days after the data is collected.

Characteristics of customer groups who are likely to churn

  1. While the service area in region A has about 5% greater chance to churn than region B, which also has about 1% higher probability of churn than region C, region C has about 5% higher churn probability of region D.
  2. Customers who made calls to retention team has almost 5% higher chance to churn than those who do not.
  3. Customers groups that lack percent change in number of minutes used over previous 4 months in the data are 5% more likely to churn


  1. Examine and improve cell phone service reception first in region A, followed by region B, C, and D in order.
  2. Offer customers who call retention team discounts on next month phone service varying on service issues
  3. Further explore causes of missing information in percent change in number of minutes and over the previous 4 months
  4. Analyze common mentioned topics and concerns from customers by transcribing call center recordings to better offer appealing promotions in mails

Data Cleaning and processing:

  1. Both training and holdout sample were first combined and then split back after cleaning process to fit different models
  2. There are several missing values in the data. The missingness in the following variables are statistically significant to predict churn and are included as added predictors in the model: monthly revenue, monthly minutes, total recurring charge, directorassisted calls, overage Minutes, Roaming calls, Percentage change minutes, Percentage change revenues. And the missing values are replaced by the medians of each variable
  3. Service Area contains too many levels. The variable is replaced by the broken down variables area.code and region.code that their infrequent levels were grouped together using with each has 4 different levels

Fitting Model:

  1. The following are models that is used to fit training sample using ROC: vanilla partition model, random forest model, boosted tree random forest model.
  2. The best model is selected by comparing ROC among the best model of each model type. The best model is boosted tree with ROC .683on the training sample
  3. Utilizing the strengths of all models has not been done because not only it takes longer time, but the boosted tree also is already very tuning intensive, more powerful than random forest and powerful enough to predict customer churn accurately.
  4. The most important factor in the model are CurrentEquipmentDays, MonthsInService, MonthlyMinutes, PercChangeMinutes, PercChangeRevenues, and TotalRecurringCharge

Exploring Factors associated with churn:

  1. Exploring important drivers of churn is done using the function varImp in caret package that gives the most important variables in the model.
  2. Accounting for lurking variables is used to compare differences in churn between groups by first create a model predicting everything but the variable. Then compare its churn probability to find the pure differences between groups before and after taking lurking variables into account. The variables that have been explored are Region, MadeCallToRetentionTeam, and MissingPercentMin.
  3. Each region (a, b, c, d) in Region variable has significant difference in probability to churn before and after taking lurking variables into account. The effects in probability to churn after taking into accounting the lurking variables has almost 50% increase.
  4. After taking into account for variable MadeCallToRetentionTeam, the churn probability increases about 100%. It allows us to see the fairest comparison of churn probability.


Predicting Cellular Telephone Customer Churn Data


Database marketing, managing churn to maximize profits. Predicting cellular telephone customer churn data– This work data is from Fuqua school of business. Teradata center for customer relationship management at Duke University.

Overview of cellular telephone industry

I had a chance to build models to predict customer churn from cellular telephone customer data, but before we get started, here are some information about the industry. Currently, investors are doubtful in wireless returns due to higher drop in S&P Wireless/Cellular index comparing to S&P 500. And because the industry growth has been declined dramatically, all carriers stays competitive by expanding their service and offering cut-throat deals to attract more subscribers.

Overview of Cell2Cell company

The company is the 6th largest company in the US with nearly 50 states coverage. Its strength are network infrastructure, expanding service areas, and marketing. Their stores are known to be ubiquitous.

Investors raised doubts about the company’s stability like other cellular telephone companies and in addition, there has been a decrease in customer acquisition and an increase in customer churn.

Their monthly’s churn rate is 4% while half of it is involuntary (i.g. when customers do not pay their bills). However, our goal is to focus proactively on customers who churn voluntarily. The reason that customers churn is that there are many carriers in the market offering similar deals and cheaper handsets. Furthermore, now because customers can change carriers without having to change their cellphone numbers, customer churn rate explodes.

Marketing Overview

In marketing, Reactive retention program is a common program in the industry. The program aims to convince churning customers to stay with the company by offering better deals when they call. However, proactive retention program allows the company to  retain customers with better approach. The program is a predictive CRM (Customer Retention Management). The company utilizes its churn model to identify specific customer groups who are likely to churn to acknowledge their loyalty before they make a call to cancel their subscription. Our target is customers who are 75% more likely to churn than aveage customers due to limited resources.

Churn model

Our churn model predicts churn probability of current customers from customers data from the past. The purpose of the model is to identify factors that influence customers to churn and customers who are likely to churn. To start off, we’ll be importing, cleaning data and handling missing values in variables. All the codes and the data can be found on GitHub here.

Preprocessing Data

Because missingness is an important factor in predicting churn, we’ll create new columns for missingness and replace missing values with its median for each variable.

After further investigating all the variables in the data, there are some variables that have too many levels, so combining rare levels is needed in a variable. And the following are functions to handle infrequent levels in a variable.

Combining rare levels in variables and replace train and holdout with imputed values.

Fitting models

After we have our data nice and clean, the next step is to fit a model predicting customer’s churn.

We’ll be using parallel to fit a model and compare results among 3 different models:

  1. Vanilla partition model
  2. Random forest model
  3. Boosted tree model

Fitting recursive partition model

The parameter of the best recursive partition model is .0005298317 having an ROC of .6266175.

Fitting random forest model

Random forest model yeilds ROC of .6640235 which is better than recursive rpart.

Fitting boosted tree model

Boosted tree model’s ROC is .6842619 which is higher than random forest model’s.

Selecting the best model

The best model is boosted tree model with parameter of 2000;0.01;4;10 that has ROC of 0.6842619.

Next is predicting data that we hold it out to evaluate our model, and the model predicts 72% accurate which is good.

After we have our best model, the next step is to further explore factors that influence customer’s churn.

The top variables that greatly effect customer’s churn:

  1. Number of days customer has use current equipment
  2. Total number of months customer has subscribe
  3. Average monthly minute over previous 4 months

All the results with more details answering which customer is likely to churn can be found here

Exploring car tweets and interacting with the data

After I collected data about each car brand including Tesla, Porsche, Audi, BMW and Mercedes-Benz in numbers of tweets in the previous post, in this post, I’d like to visualize the data to answer our objective questions, allowing the data to communicate and telling us stories.

In this work, the goal is to analyze and answer the following questions

  1. What are some major subjects and issues that people are tweeting about each brand?
  2. What are some current trends in automobile industry among these 5 car brands?
  3. What are some characteristics and features people are interested in in different areas of automobile (e.g. electric SUV and environmental impact)?
  4. What are some topics people are tweeting and paying attention to specific characteristics (i.g. self driving, driving experience)?
  5. What are some customers’ attitude towards particular topic, features, and products (e.g. positive, negative)?

My approach is instead of using fixed and normal charts generating from Tableau, Excel or R, I am going to use the new popular data visualization application Shiny App. Shiny App is an interactive web applications using R Programming, empowering users to utilize data and make decisions without requiring any knowledge in HTML, CSS or JavaScript. The app allows users from comparing and interacting with the data and analysis to tweaking and defining variable values in the charts. So, it is much more fun, powerful, sophisticated and intelligible to visualize and communicate the data with two or more variables than normal charts that are fixed and tell limited stories. All the codes behind the app can be found here on GitHub.

Follow this link to interact with the data yourself on Shiny App: https://janebunr.shinyapps.io/shinnycars/


Sample Interactive Shiny Interface


Using Twitter to learn what people think about major automobile companies

Supercars are fast, exciting, exotic, and irresistible. One of the topics that I’m enthusiastic about is cars, and this is the reason today I decided to explore different topics and trends that people are currently talking about among major supercar automakers.

The purpose of this work is to visualize and gain insightful information from raw qualitative data or descriptive information from an open source data Twitter, about customers’ brand attitudes among five major automobile companies:

  1. Tesla,
  2. Porsche AG
  3. Audi AG
  4. BMW
  5. Mercedes-Benz




The major focus are factors that play a major role influencing customers’ attention and brand attractiveness including current brand perception, and product attributes from car features to individual driving experiences.


Social media and its popularity indeed change our daily lives and buying decisions in recent years. Social sites not only enhance our experience, but they also allow us to be heard and satisfy our curiosity because we crave for information. People complain, compliment and capture their current moments from tweeting about how upset they’re when their favorite sport teams lose to sharing their ordinary moments. Because this data is raw and first handed, it allows me to reach an individual and expose to innumerable numbers of tweets that people are talking about these five major car brands. My goal is to look into customers’ attitude towards each brand including the attractiveness, innovativeness and expensiveness of the products and hopfully their puchase intentions.

To approach, I’ll crawl data from Twitter using Twitter API in the total of 150,000 tweets (30,000 tweets each car brand), analyze using text analysis and topic model to visualize insightful information using R Programming. The book that I use to refer to is Text Mining with R: A Tidy Approach by Julia Silge and David Robinson and you can find all the codes on GitHub here.


Getting data from Twitter

First, to crawl tweets about interested car brands from Twitter using Twitter API using key words “a tesla”, “Porsche”, “BMW”,”Audi” and “Mercedes” 10,000 tweets per day each brand on 3 different days (total of 30,000 tweets each brand or 150,000 tweets overall) And then store as a dataset each to a local database.


api_key = "[api key]"
api_secret = "[api secret]"
access_token = "[access_token]"
access_token_secret = "[access token secret]"

TESLA <- searchTwitter("a tesla", n=10000, lang="en", retryOnRateLimit = 500) ;store_tweets_db(TESLA,table_name="Tesla")
PORSCHE <- searchTwitter("porsche", n=10000, lang="en", retryOnRateLimit = 120); store_tweets_db(PORSCHE,table_name="porsche")

BMW <- searchTwitter("BMW", n=10000, lang="en", retryOnRateLimit = 120); store_tweets_db(BMW,table_name="BMW")

AUDI <- searchTwitter("audi", n=10000, lang="en",retryOnRateLimit = 120); store_tweets_db(AUDI,table_name="Audi")

MERCEDES <- searchTwitter("mercedes", n=10000, lang="en",retryOnRateLimit = 120); store_tweets_db(MERCEDES,table_name="Mercedes")

Next step is to load data from database and handle with emoticons which could cause some disasters if not been removed before analyzing the data. You can download all the data I crawled on GitHub here.

Tesla = load_tweets_db(table_name = "Tesla")
Tesla <- do.call("rbind",lapply(Tesla, as.data.frame))
Tesla$text <- iconv(Tesla$text, "ASCII", "UTF-8", sub="")

Porsche = load_tweets_db(table_name ="porsche")
Porsche <- do.call("rbind",lapply(Porsche, as.data.frame))
Porsche$text <- iconv(Porsche$text, "ASCII", "UTF-8", sub="")

bmw = load_tweets_db(table_name = "BMW")
bmw <- do.call("rbind",lapply(bmw, as.data.frame))
bmw$text <- iconv(bmw$text, "ASCII", "UTF-8", sub="")

Audi = load_tweets_db(table_name = "Audi")
Audi <- do.call("rbind",lapply(Audi, as.data.frame))
Audi$text <- iconv(Audi$text, "ASCII", "UTF-8", sub="")

Mercedes = load_tweets_db(table_name = "Mercedes")
Mercedes <- do.call("rbind",lapply(Mercedes, as.data.frame))
Mercedes$text <- iconv(Mercedes$text, "ASCII", "UTF-8", sub="")

Cleaning tweet data

Before analyzig tweets, because we would like to see valuable words that will satisfy our goals, it is necessary to remove any no value added texts or signs to allow us to better analyze

combined_tweets <- c(Tesla$text,Tesla2$text,Tesla3$text,Porsche$text,Porsche2$text,Porsche3$text, bmw$text, bmw2$text,bmw3$text, Audi$text, Audi2$text,Audi3$text, Mercedes$text, Mercedes2$text,Mercedes3$text)

rm_url(combined_tweets, pattern=pastex("@rm_twitter_url", "@rm_url")) # removing urls

text <- lapply(combined_tweets, function(x) {
 x = gsub('http\\S+\\s*', '', x) ## Remove URLs
 x = gsub('\\b+RT', '', x) ## Remove RT
 x = gsub('#\\S+', '', x) ## Remove Hashtags
 x = gsub('@\\S+', '', x) ## Remove Mentions
 x = gsub('[[:cntrl:]]', '', x) ## Remove Controls and special characters
 x = gsub("\\d", '', x) ## Remove Controls and special characters
 x = gsub('[[:punct:]]', '', x) ## Remove Punctuations
 x = gsub("^[[:space:]]*","",x) ## Remove leading whitespaces
 x = gsub("[[:space:]]*$","",x) ## Remove trailing whitespaces
 x = gsub(' +',' ',x) ## Remove extra whitespaces


CAR_TWEETS <- data_frame(line=1:nrow(data.frame(combined_tweets)), text = as.character(text))
brand_list <- data_frame(CarBrand = rep(c("Tesla", "Porsche" ,"BMW" ,"Audi", "Mercedes-Benz"), each=30000),line=1:nrow(data.frame(combined_tweets)))
CAR_TWEETS <- CAR_TWEETS %>% left_join(brand_list)

Exploring the data


The following is a wordcloud showing the most common words among all tweets about 5 mentioned car brands

tidy_text <- CAR_TWEETS %>% unnest_tokens(word,text)

tidy_text %>%
 anti_join(stop_words) %>%
 count(word) %>%
 with(wordcloud(word, n, max.words = 150))
Screen Shot 2018-03-20 at 8.24.24 AM.png

150 most common words in the data

Wordcloud above shows several common words in the data, suggesting possible popular topics mentioned among each car brand including budget, coupe, flying, google, and economy.

Lists of most common positive and negative words

bing_word_counts <- tidy_text %>%
 inner_join(get_sentiments("bing")) %>%
 count(word, sentiment, sort = TRUE) %>%

bing_word_counts %>%
 group_by(sentiment) %>%
 top_n(15) %>%
 ungroup() %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(word, n, fill = sentiment)) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~sentiment, scales = "free_y") +
 labs(y = "Contribution to sentiment",
 x = NULL) +

Wordcloud of most common positive and negative words

tidy_text %>%
 inner_join(get_sentiments("bing")) %>%
 count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("red2", "blue3"),
 max.words = 100)

Screen Shot 2018-03-20 at 8.07.17 AM


The data suggests the following top words that people have strong positive connections related to the brands: luxury, grand, classic, sexy, great, beautiful, like, love, win, dynamic, impressive, rich, smart, ready, work, top, perfect, powerful, stunning, hot, stylish, stable, fastest, pretty, fun, best

While the following words lead to negative connections: suspicious, stormy, smash, weird, rumors, problem, dead, bad, threat, steal, fails, disappoint, lack, stolen, broke, bad, exhaust, hurt, expensive, scare, suspect



Screen Shot 2018-03-20 at 8.45.13 AM

Most positive and negative words in the data


Companies’ Future action

A company can create car advertisements to create strong emotional connections to viewers and to drive consumers to make a purchase, portraying the traits of the brand and the car to be the best, luxurious, grand, classic, sexy, to be loved, to be liked, beautiful, winning, dynamic, impressive, rich, smart, ready, perfect, powerful, stunning, hot, stylish, stable, fastest, pretty, and fun.

While at the same time, a company can take actions within their company to restore its brand and make announcements and create advertisements to reduce and avoid negative connotation towards the brand or any product that related to suspicious, stormy, smash, weird, dead, bad, threat, steal, fails, lack, stolen, broke, bad, exhaust, hurt, expensive, scare, suspect.



New Year New Goals?

I finally made the time to cross out some of the goals from last years and updated some of my new goals for this year



  1. Learn how to surf
  2. Go ice skating
  3. Go mountain biking
  4. Go skiing
  5. Go camping
  6. Go fishing
  7. Plan with some friends to go see Olympics in Japan
  8. Hikiing in Seattle
  9. Go backpacking. Take a trip. Get adventurous. Meet new people
  10. Learn how to fight
  11. Learn how to bollywood dance (I am obsessed with Indian culture)
  12. Cook for friends



  1. More caring and empathetic to whom most important to me
  2. A business owner
  3. An amazing cook
  4. The best at what I do. Know my niche, my audience and serve them better than anyone else.
  5. Someone who is not afraid to ask for more or something unique
  6. Someone who uses sophisticated words in normal conversations



  1. Skills to build deeper connections with friends and family
  2. Skills and tools to cook fusion food
  3. Emotional intelligence



  1. Read a book and meditate before sleep
  2. “Just ask” for more
  3. Dedicate time to write a blog
  4. Physically and mentally prepared before each Crossfit class
  5. Keep up with text analysis and marketing analytics
  6. Applying useful advices from connecting with different alumni
  7. Give up on short-term mindset
  8. Stay hydrated
  9. More stricted on my structured schedules


But what’s more important after setting these goals is not only to follow them, but to also mark and note the reasons why you first fail to follow your plans to keep improving the strategies to achieve the goals.