Cyclistics Bike-share Analysis

Introduction

About Cyclistic

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

My role

I’m a junior data analyst working in the marketing analytics team at Cyclistics, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, The data analytics team want to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual members into annual members.
In order to answer the key business questions, I will follow the step of data analysis process.

Data Analysis Processes

Ask - Define the problem and understand stakeholders expectation.
Prepare- Collecting and storing data for analysis.
Process - Cleaning and transforming data to a useful format.
Analyse - Using R to organize information to be able to draw useful conclusion and draw informed decision making.
Share - Sharing data insights through data visualizations.
Act - Solving problems through insights gotten from the analysis.

Ask

Business Task Cyclistic Executive Team want to maximize their current customers base by converting riders of single_ride passes and full_day passes to riders of annual membership.
Stakeholders
Primary stakeholder- Cyclistic Executive Team
Secondary stakeholder- Lily Moreno- Director of Marketing

The three questions asked were:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

I was assigned by the Cyclistic Marketing Analytics Team to answer the first question “How do annual members and casual riders use Cyclistic bikes differently?”. In my analysis, I need to articulate how annual and casual members use Cyclistic Bike-share differently and identify the trend of how both set of customers use bike to be able to provide contextual recommendations on converting casual members to annual members.

Scope of Work(SOW)

Who collected the data? Cyclistic Bike-share collects data from its customers. I am using Cyclistic historical data.
What is the data impact? The data gives insight of how Cyclistic riders in Chicago ride bike.
Where is the data origin? The data was taken from Cyclistic riders who live in Chicago and was stored on their database.
When was the data created and collected? For my analysis I needed the most recent data of a year, I collected data of 2021 October till 2022 September.
Why was this data set created? To track and analyse Cyclistic riders data to gain insights on how to foster Cyclistic growth and customer satisfaction.
How was the data created? The data was gotten from the bikes since all Cyclistic bikes are geotracked

Prepare

Where the data is located The data was made available on this link
How the data is organized? The data is a structured data so it is ordered in rows and columns. It is a long data with 13 columns and 5883043 rows
verifying data credibility The data is credible has we used all Cyclistics historical trip data gotten from every customer riders.
Does the data ROCCC?
Reliable-The data is reliable, it is accurate, complete and not biased.
Original- The data is original has it is gotten directly from Cyclistic Bike-share users
Comprehensive- The data is complete to answer the question “How do annual members and casual riders use Cyclistic bikes differently?”.
Current- The data is current as it was from October 2021 till September 2022.
Cited- The data is from a vetted source “Cyclistic Analystic Team” The data is original has it is gotten directly from Cyclistic Bike-share users.
Addressing licensing, privacy, security, and accessibility? The data has been made available by Motivate International Inc. under this license.
Verifying data’s integrity? The data passed the data ethics test. It is collected with the consent of users. Personal information of users are preserved.
How the data help in answering the question The data contain all information needed to answer the question, “How do annual members and casual riders use Cyclistic bikes differently?”
Are there any problems with the data? No, it is ethical and non-bias

Process

Why I chose R for analyzing and cleaning the data R programming language and RStudio was use because the data is too large to commpute on Excel spreadsheet. Also it is easier analysing and visualising on R than SQL. The R packages used are: dplyr, lubridate, stringr, tibble, readr, tidyverse and knitr.
Loading packages

library(dplyr)
library(lubridate)
library(stringr)
library(tibble)
library(readr)
library(tidyverse)
library(knitr)

Loading data

#To print the working directory file path
getwd()

## [1] "C:/Users/dell/Documents"

# to comfirm the existence of our file in the right path
file.exists("Excel/1st project/202109-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202110-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202111-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202112-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202201-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202202-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202203-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202204-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202205-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202206-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202207-divvy-tripdata.csv")

## [1] TRUE

file.exists("Excel/1st project/202208-divvy-tripdata.csv")

## [1] TRUE

#with the help of the readr package, reading a csv file using the read_csv command
df202109 <- read_csv("Excel/1st project/202109-divvy-tripdata.csv")
df202110 <- read_csv("Excel/1st project/202110-divvy-tripdata.csv")
df202111 <- read_csv("Excel/1st project/202111-divvy-tripdata.csv")
df202112 <- read_csv("Excel/1st project/202112-divvy-tripdata.csv")
df202201 <- read_csv("Excel/1st project/202201-divvy-tripdata.csv")
df202202 <- read_csv("Excel/1st project/202202-divvy-tripdata.csv")
df202203 <- read_csv("Excel/1st project/202203-divvy-tripdata.csv")
df202204 <- read_csv("Excel/1st project/202204-divvy-tripdata.csv")
df202205 <- read_csv("Excel/1st project/202205-divvy-tripdata.csv")
df202206 <- read_csv("Excel/1st project/202206-divvy-tripdata.csv")
df202207 <- read_csv("Excel/1st project/202207-divvy-tripdata.csv")
df202208 <- read_csv("Excel/1st project/202208-divvy-tripdata.csv")

Ensuring the data’s integrity

checking for spelling errors in the string values li>
checking if any entry has duplicate values li>
checking for null values li>
computing the duration of ride by finding the difference of the start_at and end_at.

The data are in multiple files containing citibike data for each month, The data of 12 months were collected to be analysed from September 2021 till August 2022. All this month files were merged to be able to carry out a year analysis on citibike bike riders.

citibike_data <- rbind(df202109, df202110, df202111, df202112, df202201, df202202, 
                       df202203, df202204, df202205, df202206, df202207, df202208)

checking for spelling errors in the string values

# Checking if any entry of the rideabletype was spelt wrong
citibike_data %>%
  select(., rideable_type) %>%
  group_by(., rideable_type) %>%
  summarise()

## # A tibble: 3 × 1
##   rideable_type
##   <chr>        
## 1 classic_bike 
## 2 docked_bike  
## 3 electric_bike

# no spelling error as only 3 bike types exist

# Checking for if any entry of the member_casual was spelt wrong
citibike_data %>%
  select(., member_casual) %>%
  group_by(., member_casual) %>%
  summarise()

## # A tibble: 2 × 1
##   member_casual
##   <chr>        
## 1 casual       
## 2 member

# no spelling error as only 2 customer types exist

checking if any entry has duplicate values

citibike_data %>% 
  group_by(ride_id) %>% 
  summarise(count = n_distinct(ride_id)) %>% 
  filter(count > 1)

## # A tibble: 0 × 2
## # … with 2 variables: ride_id <chr>, count <int>

# I tested if we have any duplicate values, the test came out negative. We grouped by ride_id since it is said to be with unique values

computing the duration of ride by finding the difference of the start_at and end_at

citibike <- citibike_data %>% select(., ride_id, started_at, ended_at, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  mutate(., duration = ended_at - started_at)

Data Cleaning

cleaning the data set (formatting data types) li>
cleaning the negative duration value

cleaning the data set (formatting data types)

summary(citibike)

##    ride_id            started_at                    
##  Length:5883043     Min.   :2021-09-01 00:00:06.00  
##  Class :character   1st Qu.:2021-11-06 13:47:33.00  
##  Mode  :character   Median :2022-05-07 12:30:47.00  
##                     Mean   :2022-03-22 05:41:41.57  
##                     3rd Qu.:2022-07-06 16:00:29.50  
##                     Max.   :2022-08-31 23:59:39.00  
##     ended_at                      rideable_type      start_station_id  
##  Min.   :2021-09-01 00:00:41.00   Length:5883043     Length:5883043    
##  1st Qu.:2021-11-06 14:07:58.50   Class :character   Class :character  
##  Median :2022-05-07 12:52:55.00   Mode  :character   Mode  :character  
##  Mean   :2022-03-22 06:01:26.78                                        
##  3rd Qu.:2022-07-06 16:18:47.00                                        
##  Max.   :2022-09-06 21:49:04.00                                        
##  end_station_id     member_casual        start_lng        start_lat    
##  Length:5883043     Length:5883043     Min.   :-87.84   Min.   :41.64  
##  Class :character   Class :character   1st Qu.:-87.66   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Median :-87.64   Median :41.90  
##                                        Mean   :-87.65   Mean   :41.90  
##                                        3rd Qu.:-87.63   3rd Qu.:41.93  
##                                        Max.   :-73.80   Max.   :45.64  
##    duration       
##  Length:5883043   
##  Class :difftime  
##  Mode  :numeric   
##                   
##                   
##

# The duration column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike$duration <- as.numeric(citibike$duration)

summary(citibike)

##    ride_id            started_at                    
##  Length:5883043     Min.   :2021-09-01 00:00:06.00  
##  Class :character   1st Qu.:2021-11-06 13:47:33.00  
##  Mode  :character   Median :2022-05-07 12:30:47.00  
##                     Mean   :2022-03-22 05:41:41.57  
##                     3rd Qu.:2022-07-06 16:00:29.50  
##                     Max.   :2022-08-31 23:59:39.00  
##     ended_at                      rideable_type      start_station_id  
##  Min.   :2021-09-01 00:00:41.00   Length:5883043     Length:5883043    
##  1st Qu.:2021-11-06 14:07:58.50   Class :character   Class :character  
##  Median :2022-05-07 12:52:55.00   Mode  :character   Mode  :character  
##  Mean   :2022-03-22 06:01:26.78                                        
##  3rd Qu.:2022-07-06 16:18:47.00                                        
##  Max.   :2022-09-06 21:49:04.00                                        
##  end_station_id     member_casual        start_lng        start_lat    
##  Length:5883043     Length:5883043     Min.   :-87.84   Min.   :41.64  
##  Class :character   Class :character   1st Qu.:-87.66   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Median :-87.64   Median :41.90  
##                                        Mean   :-87.65   Mean   :41.90  
##                                        3rd Qu.:-87.63   3rd Qu.:41.93  
##                                        Max.   :-73.80   Max.   :45.64  
##     duration      
##  Min.   :  -8245  
##  1st Qu.:    363  
##  Median :    643  
##  Mean   :   1185  
##  3rd Qu.:   1160  
##  Max.   :2442301

# some irregularities where found in the data as the min value is -8245 and max value is 2442301

cleaning the negative duration value For this process we try as much as possible not to loose any value in the dataset. Instead of deleting the negative durations we seperated it from the entire dataset and worked on it then merged it back with the whole dateset to get an wholesome dateset.

# viewing the duration of negative value
citibike %>%
  select(., started_at, ended_at, duration) %>%
  filter(., duration < 0)

## # A tibble: 135 × 3
##    started_at          ended_at            duration
##    <dttm>              <dttm>                 <dbl>
##  1 2021-09-29 17:04:38 2021-09-29 17:04:27      -11
##  2 2021-09-01 17:49:37 2021-09-01 17:49:31       -6
##  3 2021-09-29 16:53:34 2021-09-29 16:53:29       -5
##  4 2021-09-01 18:45:38 2021-09-01 18:45:24      -14
##  5 2021-09-29 18:42:50 2021-09-29 18:36:24     -386
##  6 2021-09-29 16:10:02 2021-09-29 16:09:59       -3
##  7 2021-09-29 15:40:52 2021-09-29 15:40:18      -34
##  8 2021-09-29 16:56:09 2021-09-29 16:53:46     -143
##  9 2021-09-29 14:02:52 2021-09-29 14:02:51       -1
## 10 2021-09-29 15:21:39 2021-09-29 15:21:09      -30
## # … with 125 more rows

# To clean the data, the negative value is separated from the positive value
citibike_pos <- citibike %>% 
  select(., ride_id, started_at, ended_at, duration,rideable_type,
         start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  filter(., duration >= 0)

#
citibike_neg <- citibike %>%
  select(., ride_id, started_at = ended_at, ended_at = started_at, duration, rideable_type, start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  filter(., duration < 0)

# Then it was joined together after cleaning
citibike_abs <- rbind(citibike_pos, citibike_neg) %>%
  select(., ride_id, started_at, ended_at, duration, rideable_type,
         start_station_id, end_station_id, member_casual, start_lng, start_lat) %>%
  mutate(., dur = ended_at - started_at) 
# The dur column data type 'datediff' can not be worked with so it has to be changed to a numeric type
citibike_abs$dur <- as.numeric(citibike_abs$dur)

How can you verify that your data is clean and ready to analyze?

# Checking if the problem as been resolved
citibike_abs %>% 
  filter(., dur < 0)

## # A tibble: 0 × 11
## # … with 11 variables: ride_id <chr>, started_at <dttm>, ended_at <dttm>,
## #   duration <dbl>, rideable_type <chr>, start_station_id <chr>,
## #   end_station_id <chr>, member_casual <chr>, start_lng <dbl>,
## #   start_lat <dbl>, dur <dbl>

# There is no negative duration

Exploratory Data Analysis

Note: This analysis was made necessary to check how casual and annual members use Citibike bike differently.
1. Checking for monthly trend in bike usage
2. Checking for weekday bike usage
3. Checking for hour trend in bike usage
4. Checking average ride duration of casual and annual members

Extracting month, weekday and hour from the started_at(datettime) column

# A column of  month was made and the month value was extracted from the started_at column
citibike_abs$month <- month(citibike_abs$started_at, label = TRUE)

# A column of weekday was made and the weekday value was extracted from the started_at column
citibike_abs$weekday <- wday(citibike_abs$started_at, label = TRUE)

# A column of  hour was made and the hour value was extracted from the started_at column
citibike_abs$hour <- hour(ymd_hms(citibike_abs$started_at))

citibike_analyze <- citibike_abs %>%
  select(., ride_id, started_at, rideable_type, dur, member_casual, month, weekday, hour, start_lng, start_lat)
citibike_analyze$dur <- as.numeric(citibike_analyze$dur)
# extracting the date out of the datetime format
citibike_analyze$date <- as.Date(citibike_analyze$started_at)

# adding the date column to the data
citibike_analyze <- citibike_analyze %>%
  select(., date, started_at ,rideable_type, dur, member_casual, month, weekday, hour)

Checking for monthly trend in bike usage

citibike_analyze %>%
  select(., member_casual, month) %>%
  group_by(., member_casual, month) %>%
  summarise(., month_count = n()) %>%
  arrange(., desc(month_count))

## # A tibble: 24 × 3
## # Groups:   member_casual [2]
##    member_casual month month_count
##    <chr>         <ord>       <int>
##  1 member        Aug        427008
##  2 member        Jul        417433
##  3 casual        Jul        406055
##  4 member        Jun        400153
##  5 member        Sep        392257
##  6 member        Oct        373984
##  7 casual        Jun        369051
##  8 casual        Sep        363890
##  9 casual        Aug        358924
## 10 member        May        354443
## # … with 14 more rows

Checking for weekday bike usage

# the amount of times casual and members frequently use bikes during the weeks
weekday_group <- citibike_analyze %>%
  select(., member_casual, weekday) %>%
  group_by(., member_casual, weekday) %>%
  summarise(., weekday_count = n())

Grouping number of rider by weekday

kable(weekday_group)

member_casual	weekday	weekday_count
casual	Sun	437447
casual	Mon	292184
casual	Tue	278023
casual	Wed	293245
casual	Thu	311489
casual	Fri	346027
casual	Sat	510064
member	Sun	404525
member	Mon	474818
member	Tue	536494
member	Wed	547038
member	Thu	525487
member	Fri	472180
member	Sat	454022

Plotting weekday_group

ggplot(weekday_group, aes(weekday, weekday_count, fill = member_casual)) +
  geom_col(position = "dodge")

Checking for hour trend in bike usage

ride_per_hour <- citibike_analyze %>%
  select(., member_casual, hour) %>%
  group_by(., member_casual, hour) %>%
  summarise(., hour_count = n()) %>%
  arrange(., desc(hour_count))

Grouping number of rider by hour

kable(ride_per_hour)

member_casual	hour	hour_count
member	17	355081
member	16	293868
member	18	292754
casual	17	231580
member	15	223312
member	19	210495
casual	18	209786
member	8	205851
casual	16	205589
member	12	194427
member	13	191356
member	14	189171
casual	15	186811
member	7	175176
casual	14	169913
member	11	169171
casual	13	162081
casual	19	161806
casual	12	154543
member	9	148408
member	20	147948
member	10	141487
casual	11	131899
casual	20	119137
member	21	114694
casual	10	102084
casual	21	102055
casual	22	93429
member	6	92186
member	22	87831
casual	9	77676
casual	8	70789
casual	23	69670
member	23	58324
casual	7	52971
casual	0	49782
member	0	36511
member	5	33926
casual	1	32794
casual	6	29246
member	1	22700
casual	2	20766
casual	5	13219
member	2	12899
casual	3	12120
member	4	9096
casual	4	8733
member	3	7892

Plotting ride_per_hour

ggplot(ride_per_hour, aes(hour, hour_count, color = member_casual)) +
  geom_line()

Checking average ride duration of casual and annual members

ride_per_day <- citibike_analyze %>%
  select(.,date, member_casual, dur) %>%
  group_by(.,date, member_casual) %>%
  summarise(., avg_dur = mean(dur)) %>%
  arrange(., date)

Showing the trend of average bike riders duration per day

ggplot(ride_per_day, aes(date, avg_dur, color = member_casual)) +
  geom_line()

write_csv(citibike_analyze, "citibike.csv")

The cleaned data was exported out for further visualization and analysis in Tableau see link

Act

Insight

Members tend to use bike more during the week days to commute to work and the use bikes the least during weekends
Casual members tends to use bike as opposed to the members more on weekends and the least on weekdays. This Trends shows that most people that have membership are people that often use the bike to commute to work hence, the large amount of people that use the bike during the weekdays. On the contrary, casual riders most likely use bikes for liesure and casual errands.
Backing our earlier claim that a huge chunk of member use cyclistic to commute to work. Annual members use bikes to commute to work that why there is increase in bike usage from 7am to 9 am see(ride_per_hour plot).

Recommendation

Since casual riders rides bike mostly during weekends I suggest a mouth watering weekend discount on weekends that will only be applicable to only annual members.
There should be dailly goal of minute of rides a person ought to make to keep fit and healthy, this will increase the frequency at which individuals use bike thereby causing them to subscribe for annual membership.

Conclusion To provide a concrete solution to converting casual members to annual members the other questions (a) Why would casual members by Cylistic annual members? and (b) How can Cyclistic use digital media to influence casual riders to become members? must be appropiately answered. With my insights and recommendations listing how casual riders use bike differently from annual members, combining these recommendations with the other analysis will give a concrete strategy of converting casual members to annual members.
I made a visualisation on tableau click here.

Cyclistics Bike-share Analysis

Josiah Paul

2023-02-14

Introduction

About Cyclistic

My role

Data Analysis Processes

Ask

The three questions asked were:

Scope of Work(SOW)

Prepare

Process

Exploratory Data Analysis

Act

Cyclistics Bike-share Analysis

Josiah Paul

2023-02-14

Introduction

About Cyclistic

My role

Data Analysis Processes

Ask

The three questions asked were:

Scope of Work(SOW)

Prepare

Process

Exploratory Data Analysis

Share

Act