Introduction

About Cyclistic

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

My role

I’m a junior data analyst working in the marketing analytics team at Cyclistics, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, The data analytics team want to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual members into annual members.
In order to answer the key business questions, I will follow the step of data analysis process.

Data Analysis Processes

Ask

Business Task Cyclistic Executive Team want to maximize their current customers base by converting riders of single_ride passes and full_day passes to riders of annual membership.
Stakeholders
Primary stakeholder- Cyclistic Executive Team
Secondary stakeholder- Lily Moreno- Director of Marketing

The three questions asked were:

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

I was assigned by the Cyclistic Marketing Analytics Team to answer the first question “How do annual members and casual riders use Cyclistic bikes differently?”. In my analysis, I need to articulate how annual and casual members use Cyclistic Bike-share differently and identify the trend of how both set of customers use bike to be able to provide contextual recommendations on converting casual members to annual members.

Scope of Work(SOW)

  • Who collected the data? Cyclistic Bike-share collects data from its customers. I am using Cyclistic historical data.
  • What is the data impact? The data gives insight of how Cyclistic riders in Chicago ride bike.
  • Where is the data origin? The data was taken from Cyclistic riders who live in Chicago and was stored on their database.
  • When was the data created and collected? For my analysis I needed the most recent data of a year, I collected data of 2021 October till 2022 September.
  • Why was this data set created? To track and analyse Cyclistic riders data to gain insights on how to foster Cyclistic growth and customer satisfaction.
  • How was the data created? The data was gotten from the bikes since all Cyclistic bikes are geotracked

Prepare

Process

library(dplyr)
library(lubridate)
library(stringr)
library(tibble)
library(readr)
library(tidyverse)
library(knitr)

Loading data

#To print the working directory file path
getwd()
## [1] "C:/Users/dell/Documents"
# to comfirm the existence of our file in the right path
file.exists("Excel/1st project/202109-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202110-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202111-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202112-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202201-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202202-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202203-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202204-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202205-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202206-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202207-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202208-divvy-tripdata.csv")
## [1] TRUE
#with the help of the readr package, reading a csv file using the read_csv command
df202109 <- read_csv("Excel/1st project/202109-divvy-tripdata.csv")
df202110 <- read_csv("Excel/1st project/202110-divvy-tripdata.csv")
df202111 <- read_csv("Excel/1st project/202111-divvy-tripdata.csv")
df202112 <- read_csv("Excel/1st project/202112-divvy-tripdata.csv")
df202201 <- read_csv("Excel/1st project/202201-divvy-tripdata.csv")
df202202 <- read_csv("Excel/1st project/202202-divvy-tripdata.csv")
df202203 <- read_csv("Excel/1st project/202203-divvy-tripdata.csv")
df202204 <- read_csv("Excel/1st project/202204-divvy-tripdata.csv")
df202205 <- read_csv("Excel/1st project/202205-divvy-tripdata.csv")
df202206 <- read_csv("Excel/1st project/202206-divvy-tripdata.csv")
df202207 <- read_csv("Excel/1st project/202207-divvy-tripdata.csv")
df202208 <- read_csv("Excel/1st project/202208-divvy-tripdata.csv")
  1. checking for spelling errors in the string values li>
  2. checking if any entry has duplicate values li>
  3. checking for null values li>
  4. computing the duration of ride by finding the difference of the start_at and end_at.

The data are in multiple files containing citibike data for each month, The data of 12 months were collected to be analysed from September 2021 till August 2022. All this month files were merged to be able to carry out a year analysis on citibike bike riders.

citibike_data <- rbind(df202109, df202110, df202111, df202112, df202201, df202202, 
                       df202203, df202204, df202205, df202206, df202207, df202208)

checking for spelling errors in the string values

# Checking if any entry of the rideabletype was spelt wrong
citibike_data %>%
  select(., rideable_type) %>%
  group_by(., rideable_type) %>%
  summarise()
## # A tibble: 3 × 1
##   rideable_type
##   <chr>        
## 1 classic_bike 
## 2 docked_bike  
## 3 electric_bike
# no spelling error as only 3 bike types exist

# Checking for if any entry of the member_casual was spelt wrong
citibike_data %>%
  select(., member_casual) %>%
  group_by(., member_casual) %>%
  summarise()
## # A tibble: 2 × 1
##   member_casual
##   <chr>        
## 1 casual       
## 2 member
# no spelling error as only 2 customer types exist

checking if any entry has duplicate values

citibike_data %>% 
  group_by(ride_id) %>% 
  summarise(count = n_distinct(ride_id)) %>% 
  filter(count > 1) 
## # A tibble: 0 × 2
## # … with 2 variables: ride_id <chr>, count <int>
# I tested if we have any duplicate values, the test came out negative. We grouped by ride_id since it is said to be with unique values

computing the duration of ride by finding the difference of the start_at and end_at

citibike <- citibike_data %>% select(., ride_id, started_at, ended_at, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  mutate(., duration = ended_at - started_at)
  1. cleaning the data set (formatting data types) li>
  2. cleaning the negative duration value

cleaning the data set (formatting data types)

summary(citibike)
##    ride_id            started_at                    
##  Length:5883043     Min.   :2021-09-01 00:00:06.00  
##  Class :character   1st Qu.:2021-11-06 13:47:33.00  
##  Mode  :character   Median :2022-05-07 12:30:47.00  
##                     Mean   :2022-03-22 05:41:41.57  
##                     3rd Qu.:2022-07-06 16:00:29.50  
##                     Max.   :2022-08-31 23:59:39.00  
##     ended_at                      rideable_type      start_station_id  
##  Min.   :2021-09-01 00:00:41.00   Length:5883043     Length:5883043    
##  1st Qu.:2021-11-06 14:07:58.50   Class :character   Class :character  
##  Median :2022-05-07 12:52:55.00   Mode  :character   Mode  :character  
##  Mean   :2022-03-22 06:01:26.78                                        
##  3rd Qu.:2022-07-06 16:18:47.00                                        
##  Max.   :2022-09-06 21:49:04.00                                        
##  end_station_id     member_casual        start_lng        start_lat    
##  Length:5883043     Length:5883043     Min.   :-87.84   Min.   :41.64  
##  Class :character   Class :character   1st Qu.:-87.66   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Median :-87.64   Median :41.90  
##                                        Mean   :-87.65   Mean   :41.90  
##                                        3rd Qu.:-87.63   3rd Qu.:41.93  
##                                        Max.   :-73.80   Max.   :45.64  
##    duration       
##  Length:5883043   
##  Class :difftime  
##  Mode  :numeric   
##                   
##                   
## 
# The duration column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike$duration <- as.numeric(citibike$duration)

summary(citibike)
##    ride_id            started_at                    
##  Length:5883043     Min.   :2021-09-01 00:00:06.00  
##  Class :character   1st Qu.:2021-11-06 13:47:33.00  
##  Mode  :character   Median :2022-05-07 12:30:47.00  
##                     Mean   :2022-03-22 05:41:41.57  
##                     3rd Qu.:2022-07-06 16:00:29.50  
##                     Max.   :2022-08-31 23:59:39.00  
##     ended_at                      rideable_type      start_station_id  
##  Min.   :2021-09-01 00:00:41.00   Length:5883043     Length:5883043    
##  1st Qu.:2021-11-06 14:07:58.50   Class :character   Class :character  
##  Median :2022-05-07 12:52:55.00   Mode  :character   Mode  :character  
##  Mean   :2022-03-22 06:01:26.78                                        
##  3rd Qu.:2022-07-06 16:18:47.00                                        
##  Max.   :2022-09-06 21:49:04.00                                        
##  end_station_id     member_casual        start_lng        start_lat    
##  Length:5883043     Length:5883043     Min.   :-87.84   Min.   :41.64  
##  Class :character   Class :character   1st Qu.:-87.66   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Median :-87.64   Median :41.90  
##                                        Mean   :-87.65   Mean   :41.90  
##                                        3rd Qu.:-87.63   3rd Qu.:41.93  
##                                        Max.   :-73.80   Max.   :45.64  
##     duration      
##  Min.   :  -8245  
##  1st Qu.:    363  
##  Median :    643  
##  Mean   :   1185  
##  3rd Qu.:   1160  
##  Max.   :2442301
# some irregularities where found in the data as the min value is -8245 and max value is 2442301

cleaning the negative duration value For this process we try as much as possible not to loose any value in the dataset. Instead of deleting the negative durations we seperated it from the entire dataset and worked on it then merged it back with the whole dateset to get an wholesome dateset.

# viewing the duration of negative value
citibike %>%
  select(., started_at, ended_at, duration) %>%
  filter(., duration < 0)
## # A tibble: 135 × 3
##    started_at          ended_at            duration
##    <dttm>              <dttm>                 <dbl>
##  1 2021-09-29 17:04:38 2021-09-29 17:04:27      -11
##  2 2021-09-01 17:49:37 2021-09-01 17:49:31       -6
##  3 2021-09-29 16:53:34 2021-09-29 16:53:29       -5
##  4 2021-09-01 18:45:38 2021-09-01 18:45:24      -14
##  5 2021-09-29 18:42:50 2021-09-29 18:36:24     -386
##  6 2021-09-29 16:10:02 2021-09-29 16:09:59       -3
##  7 2021-09-29 15:40:52 2021-09-29 15:40:18      -34
##  8 2021-09-29 16:56:09 2021-09-29 16:53:46     -143
##  9 2021-09-29 14:02:52 2021-09-29 14:02:51       -1
## 10 2021-09-29 15:21:39 2021-09-29 15:21:09      -30
## # … with 125 more rows
# To clean the data, the negative value is separated from the positive value
citibike_pos <- citibike %>% 
  select(., ride_id, started_at, ended_at, duration,rideable_type,
         start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  filter(., duration >= 0)

#
citibike_neg <- citibike %>%
  select(., ride_id, started_at = ended_at, ended_at = started_at, duration, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  filter(., duration < 0)

# Then it was joined together after cleaning
citibike_abs <- rbind(citibike_pos, citibike_neg) %>%
  select(., ride_id, started_at, ended_at, duration, rideable_type,
         start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  mutate(., dur = ended_at - started_at) 
# The dur column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike_abs$dur <- as.numeric(citibike_abs$dur)
# Checking if the problem as been resolved
citibike_abs %>% 
  filter(., dur < 0)
## # A tibble: 0 × 11
## # … with 11 variables: ride_id <chr>, started_at <dttm>, ended_at <dttm>,
## #   duration <dbl>, rideable_type <chr>, start_station_id <chr>,
## #   end_station_id <chr>, member_casual <chr>, start_lng <dbl>,
## #   start_lat <dbl>, dur <dbl>
# There is no negative duration

Exploratory Data Analysis

Note: This analysis was made necessary to check how casual and annual members use Citibike bike differently.
1. Checking for monthly trend in bike usage
2. Checking for weekday bike usage
3. Checking for hour trend in bike usage
4. Checking average ride duration of casual and annual members

Extracting month, weekday and hour from the started_at(datettime) column

# A column of  month was made and the month value was extracted from the started_at column
citibike_abs$month <- month(citibike_abs$started_at, label = TRUE)

# A column of weekday was made and the weekday value was extracted from the started_at column
citibike_abs$weekday <- wday(citibike_abs$started_at, label = TRUE)

# A column of  hour was made and the hour value was extracted from the started_at column
citibike_abs$hour <- hour(ymd_hms(citibike_abs$started_at))

citibike_analyze <- citibike_abs %>%
  select(., ride_id, started_at, rideable_type, dur, member_casual, month, weekday, hour, start_lng, start_lat)
citibike_analyze$dur <- as.numeric(citibike_analyze$dur)
# extracting the date out of the datetime format
citibike_analyze$date <- as.Date(citibike_analyze$started_at)

# adding the date column to the data
citibike_analyze <- citibike_analyze %>%
  select(., date, started_at ,rideable_type, dur, member_casual, month, weekday, hour)

Checking for monthly trend in bike usage

citibike_analyze %>%
  select(., member_casual, month) %>%
  group_by(., member_casual, month) %>%
  summarise(., month_count = n()) %>%
  arrange(., desc(month_count)) 
## # A tibble: 24 × 3
## # Groups:   member_casual [2]
##    member_casual month month_count
##    <chr>         <ord>       <int>
##  1 member        Aug        427008
##  2 member        Jul        417433
##  3 casual        Jul        406055
##  4 member        Jun        400153
##  5 member        Sep        392257
##  6 member        Oct        373984
##  7 casual        Jun        369051
##  8 casual        Sep        363890
##  9 casual        Aug        358924
## 10 member        May        354443
## # … with 14 more rows

Checking for weekday bike usage

# the amount of times casual and members frequently use bikes during the weeks
weekday_group <- citibike_analyze %>%
  select(., member_casual, weekday) %>%
  group_by(., member_casual, weekday) %>%
  summarise(., weekday_count = n())

Grouping number of rider by weekday

kable(weekday_group)
member_casual weekday weekday_count
casual Sun 437447
casual Mon 292184
casual Tue 278023
casual Wed 293245
casual Thu 311489
casual Fri 346027
casual Sat 510064
member Sun 404525
member Mon 474818
member Tue 536494
member Wed 547038
member Thu 525487
member Fri 472180
member Sat 454022

Plotting weekday_group

ggplot(weekday_group, aes(weekday, weekday_count, fill = member_casual)) +
  geom_col(position = "dodge") 

Checking for hour trend in bike usage

ride_per_hour <- citibike_analyze %>%
  select(., member_casual, hour) %>%
  group_by(., member_casual, hour) %>%
  summarise(., hour_count = n()) %>%
  arrange(., desc(hour_count))

Grouping number of rider by hour

kable(ride_per_hour)
member_casual hour hour_count
member 17 355081
member 16 293868
member 18 292754
casual 17 231580
member 15 223312
member 19 210495
casual 18 209786
member 8 205851
casual 16 205589
member 12 194427
member 13 191356
member 14 189171
casual 15 186811
member 7 175176
casual 14 169913
member 11 169171
casual 13 162081
casual 19 161806
casual 12 154543
member 9 148408
member 20 147948
member 10 141487
casual 11 131899
casual 20 119137
member 21 114694
casual 10 102084
casual 21 102055
casual 22 93429
member 6 92186
member 22 87831
casual 9 77676
casual 8 70789
casual 23 69670
member 23 58324
casual 7 52971
casual 0 49782
member 0 36511
member 5 33926
casual 1 32794
casual 6 29246
member 1 22700
casual 2 20766
casual 5 13219
member 2 12899
casual 3 12120
member 4 9096
casual 4 8733
member 3 7892

Plotting ride_per_hour

ggplot(ride_per_hour, aes(hour, hour_count, color = member_casual)) +
  geom_line()

Checking average ride duration of casual and annual members

ride_per_day <- citibike_analyze %>%
  select(.,date, member_casual, dur) %>%
  group_by(.,date, member_casual) %>%
  summarise(., avg_dur = mean(dur)) %>%
  arrange(., date)

Showing the trend of average bike riders duration per day

ggplot(ride_per_day, aes(date, avg_dur, color = member_casual)) +
  geom_line()

Share

write_csv(citibike_analyze, "citibike.csv")

The cleaned data was exported out for further visualization and analysis in Tableau see link

Act

  1. Members tend to use bike more during the week days to commute to work and the use bikes the least during weekends
  2. Casual members tends to use bike as opposed to the members more on weekends and the least on weekdays. This Trends shows that most people that have membership are people that often use the bike to commute to work hence, the large amount of people that use the bike during the weekdays. On the contrary, casual riders most likely use bikes for liesure and casual errands.
  3. Backing our earlier claim that a huge chunk of member use cyclistic to commute to work. Annual members use bikes to commute to work that why there is increase in bike usage from 7am to 9 am see(ride_per_hour plot).
  1. Since casual riders rides bike mostly during weekends I suggest a mouth watering weekend discount on weekends that will only be applicable to only annual members.
  2. There should be dailly goal of minute of rides a person ought to make to keep fit and healthy, this will increase the frequency at which individuals use bike thereby causing them to subscribe for annual membership.