A05 Assignment: EDA2
Exploratory Data Analysis of the Hotel Booking Demand
Introduction
This data set contains booking information for a city hotel and a resort hotel, including when the booking was made, length of stay, number of adults, children, and/or babies, and number of available parking spaces, among other things. All personally identifying information has been extracted from the data. To gain insight from the data, we will conduct exploratory data analysis.
Questions: 1) How would people that book the same hotels repeatedly behave? 2) How likely Would they cancel a booking? 3) What kind of meal plan would they choose? 4) How many days ahead of checking in would they book a hotel?
To answer these questions seattle reservation dataset is analyzed, which is publicly accessible data, available dataset on Kaggle.
Import Packages
First Import necessary packages and import the dataset.
Import Seattle dataset files
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <chr> "Transient", "Transient", "Transient", ~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~
Dataset link
Name | link |
---|---|
Hotel Booking Demand Dataset | Click here |
Missing value
By examining the missing values for each variable, we can see that there are only four missing values for variable “children,” with no missing values in the other variables.
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Data cleaning
Take care of the missing value The variable “children” has four missing values. The median value of this variable is used to impute them. Given the large sample size, this method would be appropriate.
Make data makes more sense
The meal “Undefined” is imputed by “SC” because both “Undefined” and “SC” indicate that the customer chose not to eat at the hotel.
The variable “is repeated guest” is binary, it should be a factor. The same modification is required for the variable “is canceled.”
- Statistical summary
We can get some useful information from the statistical summary. For example, approximately 37% of bookings have been canceled. The booking data ranges from 2015 to 2017. Customers stay for two nights in half of the weekday bookings. Customers stay for three nights in 75% of the bookings. The longest stay possible is 50 nights. The average number of prior cancellations is less than one. One special request is made in 75% of all bookings. Repeat customers account for 3% of all bookings.
## hotel is_canceled lead_time arrival_date_year
## Length:119390 0:75166 Min. : 0 Min. :2015
## Class :character 1:44224 1st Qu.: 18 1st Qu.:2016
## Mode :character Median : 69 Median :2016
## Mean :104 Mean :2016
## 3rd Qu.:160 3rd Qu.:2017
## Max. :737 Max. :2017
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB:92310 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB: 798 Class :character
## Median : 0.0000 Median : 0.000000 HB:14463 Mode :character
## Mean : 0.1039 Mean : 0.007949 SC:11819
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 0:115580
## Class :character Class :character 1: 3810
## Mode :character Mode :character
##
##
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
- Glipse the data
The dataset look clean and tidy now.
## # A tibble: 20 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <fct> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## 11 Resort Hotel 1 23 2015 July
## 12 Resort Hotel 0 35 2015 July
## 13 Resort Hotel 0 68 2015 July
## 14 Resort Hotel 0 18 2015 July
## 15 Resort Hotel 0 37 2015 July
## 16 Resort Hotel 0 68 2015 July
## 17 Resort Hotel 0 37 2015 July
## 18 Resort Hotel 0 12 2015 July
## 19 Resort Hotel 0 0 2015 July
## 20 Resort Hotel 0 7 2015 July
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <fct>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <fct>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
EXPLORATION DATA ANALYSIS
- Distribution of guests:
According to the table and piechart below,3.19 percent of the guests are repeat visitors, while the rest are first-time visitors. Because people do not return to the same location repeatedly, these figures make sense.
## # A tibble: 2 x 3
## is_repeated_guest percent total
## <fct> <dbl> <int>
## 1 0 96.8 115580
## 2 1 3.19 3810
- The difference in lead time between two groups of guests:
I created the following two functions to display the visualization in order to investigate several behaviors of repeat visitors.
guestbehaboxplot(hotel_data$lead_time)
The lead time for the repeated guests are shorter than that of unrepeated guests significantly. The average lead time for the repeated guests is 31 days, while that for unrepeated guests is 106 days.
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 106.
## 2 1 30.8
- How often do repeat customers cancel their reservations?
Unrepeated guests cancel 43672 bookings, accounting for 37.8 percent of total unrepeated guests. On the other hand, 552 repeat guests cancel their reservations, accounting for only 14.4 percent of the repeat group.
## is_repeated_guest n
## 1 NA 0.3778508
## 2 NA 0.1448819
- Do previous guests cancel their reservations?
Yes. 75.6 percent of returning guests have no cancellation history. The percentage of repeat visitors is 95.2 percent. This makes sense because unrepeated guests dress in an unrepeating manner. The average cancellation rate for repeat and non-repeated guests is 0.47 and 0.07, respectively.
## is_repeated_guest n
## 1 NA 0.9519207
## 2 NA 0.7566929
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 0.0745
## 2 1 0.470
- How many bookings have not previously been cancelled?
74.5 percent of repeat guests’ bookings have not been cancelled. Unrepeated visitors account for only 0.7 percent of all visitors. Previous bookings that have not been cancelled are 3.59 and 0.02, respectively, for repeat guests and unrepeated guests.
## is_repeated_guest n
## 1 NA 0.006765876
## 2 NA 0.744881890
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 0.0234
## 2 1 3.59
- What kind of hotel would returning guests choose?
It turns out that City Hotel is preferred by repeat guests over Resort Hotel. However, 46.7 percent of repeat guests book Resort Hotel, while only 33.1 percent of unrepeated guests book Resort Hotel.
## is_repeated_guest n
## 1 NA 0.3312165
## 2 NA 0.4666667
- Do returning guests change their reservations?
81.1 percent of returning guests do not change their booking. This figure is slightly lower than that of unrepeated visitors.
## is_repeated_guest n
## 1 NA 0.8498183
## 2 NA 0.8115486
- Meal preference of returning guests.
When comparing the meal choices of repeat guests and unrepeated guests, repeat guests have a higher percentage of choosing Bed & Breakfast and a lower percentage of choosing the other three types of meal plans.
##
## 0 1
## BB 88837 3473
## FB 789 9
## HB 14277 186
## SC 11677 142
- Do returning visitors leave a deposit?
Normally no. 98.2% of repeated guests don’t make deposit. A possiable reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.
## is_repeated_guest n
## 1 NA 0.8729798
## 2 NA 0.9821522
- Do frequent visitors make special requests?
Over half of them do not. 58.2 percent of returning guests do not make any special requests. This percentage is very close to that of first-time visitors.
## is_repeated_guest n
## 1 NA 0.5892109
## 2 NA 0.5818898
- Customer type of returning visitors
80.7 percent of returning guests make bookings that are not part of a group or contract and are unrelated to other transient bookings. Bookings associated with a group are made by 4.2 percent of repeat guests. Both of these figures are higher than the number of bookings made by first-time visitors.
## is_repeated_guest n
## 1 NA 0.7487455
## 2 NA 0.8065617
- Distribution channel
13) Histogram illustrating Days in waiting list and cancellations
#SUMMARY
Only 3.19 percent of the guests are returning.
Repeated guests tend to book the hotel one month in advance of their visit, which is much shorter than that of first-time visitors. This indicates that repeat guests do not rush to book hotels because they always know which hotel to book if they visit that location.
The likelihood of a repeat guest cancelling a booking is much lower than that of an unrepeated guest. This indicates that returning visitors are extremely loyal.
75.6 percent of returning guests have no cancellation history. 74.5 percent of repeat guests’ bookings have not been cancelled. The average number of previous bookings that have not been cancelled is 3.59.
Guests who return prefer City Hotel slightly more than Resort Hotel. However, the percentage of people who choose Resort Hotel is higher than the percentage of people who choose unrepeated guests.
81.1 percent of returning guests do not change their booking.
When booking hotels, 91.1 percent of repeat guests select Bed & Breakfast. Higher than infrequent visitors.
98.2 percent of returning guests do not make a deposit. Higher than infrequent visitors.
More than half of returning guests do not make special requests, almost as many as first-time visitors do.
80.7 percent of returning guests make bookings that are not part of a group or contract and are unrelated to other transient bookings.