In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
I’m a junior data analyst working in the marketing analytics team at Cyclistics, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, The data analytics team want to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual members into annual members. In order to answer the key business questions, I will follow the step of data analysis process.
Business Task Cyclistic Executive Team want to maximize their current customers base by converting riders of single_ride passes and full_day passes to riders of annual membership. Stakeholders Primary stakeholder- Cyclistic Executive Team Secondary stakeholder- Lily Moreno- Director of Marketing
I was assigned by the Cyclistic Marketing Analytics Team to answer the first question “How do annual members and casual riders use Cyclistic bikes differently?”. In my analysis, I need to articulate how annual and casual members use Cyclistic Bike-share differently and identify the trend of how both set of customers use bike to be able to provide contextual recommendations on converting casual members to annual members.
library(dplyr)
library(lubridate)
library(stringr)
library(tibble)
library(readr)
library(tidyverse)
library(knitr)
Loading data
#To print the working directory file path
getwd()
## [1] "C:/Users/dell/Documents"
# to comfirm the existence of our file in the right path
file.exists("Excel/1st project/202109-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202110-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202111-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202112-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202201-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202202-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202203-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202204-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202205-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202206-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202207-divvy-tripdata.csv")
## [1] TRUE
file.exists("Excel/1st project/202208-divvy-tripdata.csv")
## [1] TRUE
#with the help of the readr package, reading a csv file using the read_csv command
df202109 <- read_csv("Excel/1st project/202109-divvy-tripdata.csv")
df202110 <- read_csv("Excel/1st project/202110-divvy-tripdata.csv")
df202111 <- read_csv("Excel/1st project/202111-divvy-tripdata.csv")
df202112 <- read_csv("Excel/1st project/202112-divvy-tripdata.csv")
df202201 <- read_csv("Excel/1st project/202201-divvy-tripdata.csv")
df202202 <- read_csv("Excel/1st project/202202-divvy-tripdata.csv")
df202203 <- read_csv("Excel/1st project/202203-divvy-tripdata.csv")
df202204 <- read_csv("Excel/1st project/202204-divvy-tripdata.csv")
df202205 <- read_csv("Excel/1st project/202205-divvy-tripdata.csv")
df202206 <- read_csv("Excel/1st project/202206-divvy-tripdata.csv")
df202207 <- read_csv("Excel/1st project/202207-divvy-tripdata.csv")
df202208 <- read_csv("Excel/1st project/202208-divvy-tripdata.csv")
The data are in multiple files containing citibike data for each month, The data of 12 months were collected to be analysed from September 2021 till August 2022. All this month files were merged to be able to carry out a year analysis on citibike bike riders.
citibike_data <- rbind(df202109, df202110, df202111, df202112, df202201, df202202,
df202203, df202204, df202205, df202206, df202207, df202208)
checking for spelling errors in the string values
# Checking if any entry of the rideabletype was spelt wrong
citibike_data %>%
select(., rideable_type) %>%
group_by(., rideable_type) %>%
summarise()
## # A tibble: 3 × 1
## rideable_type
## <chr>
## 1 classic_bike
## 2 docked_bike
## 3 electric_bike
# no spelling error as only 3 bike types exist
# Checking for if any entry of the member_casual was spelt wrong
citibike_data %>%
select(., member_casual) %>%
group_by(., member_casual) %>%
summarise()
## # A tibble: 2 × 1
## member_casual
## <chr>
## 1 casual
## 2 member
# no spelling error as only 2 customer types exist
checking if any entry has duplicate values
citibike_data %>%
group_by(ride_id) %>%
summarise(count = n_distinct(ride_id)) %>%
filter(count > 1)
## # A tibble: 0 × 2
## # … with 2 variables: ride_id <chr>, count <int>
# I tested if we have any duplicate values, the test came out negative. We grouped by ride_id since it is said to be with unique values
computing the duration of ride by finding the difference of the start_at and end_at
citibike <- citibike_data %>% select(., ride_id, started_at, ended_at, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
mutate(., duration = ended_at - started_at)
cleaning the data set (formatting data types)
summary(citibike)
## ride_id started_at
## Length:5883043 Min. :2021-09-01 00:00:06.00
## Class :character 1st Qu.:2021-11-06 13:47:33.00
## Mode :character Median :2022-05-07 12:30:47.00
## Mean :2022-03-22 05:41:41.57
## 3rd Qu.:2022-07-06 16:00:29.50
## Max. :2022-08-31 23:59:39.00
## ended_at rideable_type start_station_id
## Min. :2021-09-01 00:00:41.00 Length:5883043 Length:5883043
## 1st Qu.:2021-11-06 14:07:58.50 Class :character Class :character
## Median :2022-05-07 12:52:55.00 Mode :character Mode :character
## Mean :2022-03-22 06:01:26.78
## 3rd Qu.:2022-07-06 16:18:47.00
## Max. :2022-09-06 21:49:04.00
## end_station_id member_casual start_lng start_lat
## Length:5883043 Length:5883043 Min. :-87.84 Min. :41.64
## Class :character Class :character 1st Qu.:-87.66 1st Qu.:41.88
## Mode :character Mode :character Median :-87.64 Median :41.90
## Mean :-87.65 Mean :41.90
## 3rd Qu.:-87.63 3rd Qu.:41.93
## Max. :-73.80 Max. :45.64
## duration
## Length:5883043
## Class :difftime
## Mode :numeric
##
##
##
# The duration column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike$duration <- as.numeric(citibike$duration)
summary(citibike)
## ride_id started_at
## Length:5883043 Min. :2021-09-01 00:00:06.00
## Class :character 1st Qu.:2021-11-06 13:47:33.00
## Mode :character Median :2022-05-07 12:30:47.00
## Mean :2022-03-22 05:41:41.57
## 3rd Qu.:2022-07-06 16:00:29.50
## Max. :2022-08-31 23:59:39.00
## ended_at rideable_type start_station_id
## Min. :2021-09-01 00:00:41.00 Length:5883043 Length:5883043
## 1st Qu.:2021-11-06 14:07:58.50 Class :character Class :character
## Median :2022-05-07 12:52:55.00 Mode :character Mode :character
## Mean :2022-03-22 06:01:26.78
## 3rd Qu.:2022-07-06 16:18:47.00
## Max. :2022-09-06 21:49:04.00
## end_station_id member_casual start_lng start_lat
## Length:5883043 Length:5883043 Min. :-87.84 Min. :41.64
## Class :character Class :character 1st Qu.:-87.66 1st Qu.:41.88
## Mode :character Mode :character Median :-87.64 Median :41.90
## Mean :-87.65 Mean :41.90
## 3rd Qu.:-87.63 3rd Qu.:41.93
## Max. :-73.80 Max. :45.64
## duration
## Min. : -8245
## 1st Qu.: 363
## Median : 643
## Mean : 1185
## 3rd Qu.: 1160
## Max. :2442301
# some irregularities where found in the data as the min value is -8245 and max value is 2442301
cleaning the negative duration value For this process we try as much as possible not to loose any value in the dataset. Instead of deleting the negative durations we seperated it from the entire dataset and worked on it then merged it back with the whole dateset to get an wholesome dateset.
# viewing the duration of negative value
citibike %>%
select(., started_at, ended_at, duration) %>%
filter(., duration < 0)
## # A tibble: 135 × 3
## started_at ended_at duration
## <dttm> <dttm> <dbl>
## 1 2021-09-29 17:04:38 2021-09-29 17:04:27 -11
## 2 2021-09-01 17:49:37 2021-09-01 17:49:31 -6
## 3 2021-09-29 16:53:34 2021-09-29 16:53:29 -5
## 4 2021-09-01 18:45:38 2021-09-01 18:45:24 -14
## 5 2021-09-29 18:42:50 2021-09-29 18:36:24 -386
## 6 2021-09-29 16:10:02 2021-09-29 16:09:59 -3
## 7 2021-09-29 15:40:52 2021-09-29 15:40:18 -34
## 8 2021-09-29 16:56:09 2021-09-29 16:53:46 -143
## 9 2021-09-29 14:02:52 2021-09-29 14:02:51 -1
## 10 2021-09-29 15:21:39 2021-09-29 15:21:09 -30
## # … with 125 more rows
# To clean the data, the negative value is separated from the positive value
citibike_pos <- citibike %>%
select(., ride_id, started_at, ended_at, duration,rideable_type,
start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
filter(., duration >= 0)
#
citibike_neg <- citibike %>%
select(., ride_id, started_at = ended_at, ended_at = started_at, duration, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
filter(., duration < 0)
# Then it was joined together after cleaning
citibike_abs <- rbind(citibike_pos, citibike_neg) %>%
select(., ride_id, started_at, ended_at, duration, rideable_type,
start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
mutate(., dur = ended_at - started_at)
# The dur column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike_abs$dur <- as.numeric(citibike_abs$dur)
# Checking if the problem as been resolved
citibike_abs %>%
filter(., dur < 0)
## # A tibble: 0 × 11
## # … with 11 variables: ride_id <chr>, started_at <dttm>, ended_at <dttm>,
## # duration <dbl>, rideable_type <chr>, start_station_id <chr>,
## # end_station_id <chr>, member_casual <chr>, start_lng <dbl>,
## # start_lat <dbl>, dur <dbl>
# There is no negative duration
Note: This analysis was made necessary to check how casual and annual members use Citibike bike differently. 1. Checking for monthly trend in bike usage 2. Checking for weekday bike usage 3. Checking for hour trend in bike usage 4. Checking average ride duration of casual and annual members
Extracting month, weekday and hour from the started_at(datettime) column
# A column of month was made and the month value was extracted from the started_at column
citibike_abs$month <- month(citibike_abs$started_at, label = TRUE)
# A column of weekday was made and the weekday value was extracted from the started_at column
citibike_abs$weekday <- wday(citibike_abs$started_at, label = TRUE)
# A column of hour was made and the hour value was extracted from the started_at column
citibike_abs$hour <- hour(ymd_hms(citibike_abs$started_at))
citibike_analyze <- citibike_abs %>%
select(., ride_id, started_at, rideable_type, dur, member_casual, month, weekday, hour, start_lng, start_lat)
citibike_analyze$dur <- as.numeric(citibike_analyze$dur)
# extracting the date out of the datetime format
citibike_analyze$date <- as.Date(citibike_analyze$started_at)
# adding the date column to the data
citibike_analyze <- citibike_analyze %>%
select(., date, started_at ,rideable_type, dur, member_casual, month, weekday, hour)
Checking for monthly trend in bike usage
citibike_analyze %>%
select(., member_casual, month) %>%
group_by(., member_casual, month) %>%
summarise(., month_count = n()) %>%
arrange(., desc(month_count))
## # A tibble: 24 × 3
## # Groups: member_casual [2]
## member_casual month month_count
## <chr> <ord> <int>
## 1 member Aug 427008
## 2 member Jul 417433
## 3 casual Jul 406055
## 4 member Jun 400153
## 5 member Sep 392257
## 6 member Oct 373984
## 7 casual Jun 369051
## 8 casual Sep 363890
## 9 casual Aug 358924
## 10 member May 354443
## # … with 14 more rows
Checking for weekday bike usage
# the amount of times casual and members frequently use bikes during the weeks
weekday_group <- citibike_analyze %>%
select(., member_casual, weekday) %>%
group_by(., member_casual, weekday) %>%
summarise(., weekday_count = n())
Grouping number of rider by weekday
kable(weekday_group)
member_casual | weekday | weekday_count |
---|---|---|
casual | Sun | 437447 |
casual | Mon | 292184 |
casual | Tue | 278023 |
casual | Wed | 293245 |
casual | Thu | 311489 |
casual | Fri | 346027 |
casual | Sat | 510064 |
member | Sun | 404525 |
member | Mon | 474818 |
member | Tue | 536494 |
member | Wed | 547038 |
member | Thu | 525487 |
member | Fri | 472180 |
member | Sat | 454022 |
Plotting weekday_group
ggplot(weekday_group, aes(weekday, weekday_count, fill = member_casual)) +
geom_col(position = "dodge")
Checking for hour trend in bike usage
ride_per_hour <- citibike_analyze %>%
select(., member_casual, hour) %>%
group_by(., member_casual, hour) %>%
summarise(., hour_count = n()) %>%
arrange(., desc(hour_count))
Grouping number of rider by hour
kable(ride_per_hour)
member_casual | hour | hour_count |
---|---|---|
member | 17 | 355081 |
member | 16 | 293868 |
member | 18 | 292754 |
casual | 17 | 231580 |
member | 15 | 223312 |
member | 19 | 210495 |
casual | 18 | 209786 |
member | 8 | 205851 |
casual | 16 | 205589 |
member | 12 | 194427 |
member | 13 | 191356 |
member | 14 | 189171 |
casual | 15 | 186811 |
member | 7 | 175176 |
casual | 14 | 169913 |
member | 11 | 169171 |
casual | 13 | 162081 |
casual | 19 | 161806 |
casual | 12 | 154543 |
member | 9 | 148408 |
member | 20 | 147948 |
member | 10 | 141487 |
casual | 11 | 131899 |
casual | 20 | 119137 |
member | 21 | 114694 |
casual | 10 | 102084 |
casual | 21 | 102055 |
casual | 22 | 93429 |
member | 6 | 92186 |
member | 22 | 87831 |
casual | 9 | 77676 |
casual | 8 | 70789 |
casual | 23 | 69670 |
member | 23 | 58324 |
casual | 7 | 52971 |
casual | 0 | 49782 |
member | 0 | 36511 |
member | 5 | 33926 |
casual | 1 | 32794 |
casual | 6 | 29246 |
member | 1 | 22700 |
casual | 2 | 20766 |
casual | 5 | 13219 |
member | 2 | 12899 |
casual | 3 | 12120 |
member | 4 | 9096 |
casual | 4 | 8733 |
member | 3 | 7892 |
Plotting ride_per_hour
ggplot(ride_per_hour, aes(hour, hour_count, color = member_casual)) +
geom_line()
Checking average ride duration of casual and annual members
ride_per_day <- citibike_analyze %>%
select(.,date, member_casual, dur) %>%
group_by(.,date, member_casual) %>%
summarise(., avg_dur = mean(dur)) %>%
arrange(., date)
Showing the trend of average bike riders duration per day
ggplot(ride_per_day, aes(date, avg_dur, color = member_casual)) +
geom_line()