A05 Assignment: EDA2

Exploratory Data Analysis of the Hotel Booking Demand

Introduction

This data set contains booking information for a city hotel and a resort hotel, including when the booking was made, length of stay, number of adults, children, and/or babies, and number of available parking spaces, among other things. All personally identifying information has been extracted from the data. To gain insight from the data, we will conduct exploratory data analysis.

Questions: 1) How would people that book the same hotels repeatedly behave? 2) How likely Would they cancel a booking? 3) What kind of meal plan would they choose? 4) How many days ahead of checking in would they book a hotel?

To answer these questions seattle reservation dataset is analyzed, which is publicly accessible data, available dataset on Kaggle.

Import Packages

First Import necessary packages and import the dataset.

Import Seattle dataset files

## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <chr> "Transient", "Transient", "Transient", ~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~

Missing value

By examining the missing values for each variable, we can see that there are only four missing values for variable “children,” with no missing values in the other variables.

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Data cleaning

  1. Take care of the missing value The variable “children” has four missing values. The median value of this variable is used to impute them. Given the large sample size, this method would be appropriate.

  2. Make data makes more sense

The meal “Undefined” is imputed by “SC” because both “Undefined” and “SC” indicate that the customer chose not to eat at the hotel.

The variable “is repeated guest” is binary, it should be a factor. The same modification is required for the variable “is canceled.”

  1. Statistical summary

We can get some useful information from the statistical summary. For example, approximately 37% of bookings have been canceled. The booking data ranges from 2015 to 2017. Customers stay for two nights in half of the weekday bookings. Customers stay for three nights in 75% of the bookings. The longest stay possible is 50 nights. The average number of prior cancellations is less than one. One special request is made in 75% of all bookings. Repeat customers account for 3% of all bookings.

##     hotel           is_canceled   lead_time   arrival_date_year
##  Length:119390      0:75166     Min.   :  0   Min.   :2015     
##  Class :character   1:44224     1st Qu.: 18   1st Qu.:2016     
##  Mode  :character               Median : 69   Median :2016     
##                                 Mean   :104   Mean   :2016     
##                                 3rd Qu.:160   3rd Qu.:2017     
##                                 Max.   :737   Max.   :2017     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##     children           babies          meal         country         
##  Min.   : 0.0000   Min.   : 0.000000   BB:92310   Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB:  798   Class :character  
##  Median : 0.0000   Median : 0.000000   HB:14463   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949   SC:11819                     
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                
##  Max.   :10.0000   Max.   :10.000000                                
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        0:115580         
##  Class :character   Class :character     1:  3810         
##  Mode  :character   Mode  :character                      
##                                                           
##                                                           
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14
  1. Glipse the data

The dataset look clean and tidy now.

## # A tibble: 20 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>        <fct>           <dbl>             <dbl> <chr>             
##  1 Resort Hotel 0                 342              2015 July              
##  2 Resort Hotel 0                 737              2015 July              
##  3 Resort Hotel 0                   7              2015 July              
##  4 Resort Hotel 0                  13              2015 July              
##  5 Resort Hotel 0                  14              2015 July              
##  6 Resort Hotel 0                  14              2015 July              
##  7 Resort Hotel 0                   0              2015 July              
##  8 Resort Hotel 0                   9              2015 July              
##  9 Resort Hotel 1                  85              2015 July              
## 10 Resort Hotel 1                  75              2015 July              
## 11 Resort Hotel 1                  23              2015 July              
## 12 Resort Hotel 0                  35              2015 July              
## 13 Resort Hotel 0                  68              2015 July              
## 14 Resort Hotel 0                  18              2015 July              
## 15 Resort Hotel 0                  37              2015 July              
## 16 Resort Hotel 0                  68              2015 July              
## 17 Resort Hotel 0                  37              2015 July              
## 18 Resort Hotel 0                  12              2015 July              
## 19 Resort Hotel 0                   0              2015 July              
## 20 Resort Hotel 0                   7              2015 July              
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <fct>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <fct>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>,
## #   reservation_status_date <date>

EXPLORATION DATA ANALYSIS

  1. Distribution of guests:

According to the table and piechart below,3.19 percent of the guests are repeat visitors, while the rest are first-time visitors. Because people do not return to the same location repeatedly, these figures make sense.

## # A tibble: 2 x 3
##   is_repeated_guest percent  total
##   <fct>               <dbl>  <int>
## 1 0                   96.8  115580
## 2 1                    3.19   3810

  1. The difference in lead time between two groups of guests:

I created the following two functions to display the visualization in order to investigate several behaviors of repeat visitors.

guestbehaboxplot(hotel_data$lead_time)

The lead time for the repeated guests are shorter than that of unrepeated guests significantly. The average lead time for the repeated guests is 31 days, while that for unrepeated guests is 106 days.

## # A tibble: 2 x 2
##   is_repeated_guest  mean
##   <fct>             <dbl>
## 1 0                 106. 
## 2 1                  30.8
  1. How often do repeat customers cancel their reservations?

Unrepeated guests cancel 43672 bookings, accounting for 37.8 percent of total unrepeated guests. On the other hand, 552 repeat guests cancel their reservations, accounting for only 14.4 percent of the repeat group.

##   is_repeated_guest         n
## 1                NA 0.3778508
## 2                NA 0.1448819
  1. Do previous guests cancel their reservations?

Yes. 75.6 percent of returning guests have no cancellation history. The percentage of repeat visitors is 95.2 percent. This makes sense because unrepeated guests dress in an unrepeating manner. The average cancellation rate for repeat and non-repeated guests is 0.47 and 0.07, respectively.

##   is_repeated_guest         n
## 1                NA 0.9519207
## 2                NA 0.7566929
## # A tibble: 2 x 2
##   is_repeated_guest   mean
##   <fct>              <dbl>
## 1 0                 0.0745
## 2 1                 0.470
  1. How many bookings have not previously been cancelled?

74.5 percent of repeat guests’ bookings have not been cancelled. Unrepeated visitors account for only 0.7 percent of all visitors. Previous bookings that have not been cancelled are 3.59 and 0.02, respectively, for repeat guests and unrepeated guests.

##   is_repeated_guest           n
## 1                NA 0.006765876
## 2                NA 0.744881890
## # A tibble: 2 x 2
##   is_repeated_guest   mean
##   <fct>              <dbl>
## 1 0                 0.0234
## 2 1                 3.59
  1. What kind of hotel would returning guests choose?

It turns out that City Hotel is preferred by repeat guests over Resort Hotel. However, 46.7 percent of repeat guests book Resort Hotel, while only 33.1 percent of unrepeated guests book Resort Hotel.

##   is_repeated_guest         n
## 1                NA 0.3312165
## 2                NA 0.4666667
  1. Do returning guests change their reservations?

81.1 percent of returning guests do not change their booking. This figure is slightly lower than that of unrepeated visitors.

##   is_repeated_guest         n
## 1                NA 0.8498183
## 2                NA 0.8115486
  1. Meal preference of returning guests.

When comparing the meal choices of repeat guests and unrepeated guests, repeat guests have a higher percentage of choosing Bed & Breakfast and a lower percentage of choosing the other three types of meal plans.

##     
##          0     1
##   BB 88837  3473
##   FB   789     9
##   HB 14277   186
##   SC 11677   142
  1. Do returning visitors leave a deposit?

Normally no. 98.2% of repeated guests don’t make deposit. A possiable reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.

##   is_repeated_guest         n
## 1                NA 0.8729798
## 2                NA 0.9821522
  1. Do frequent visitors make special requests?

Over half of them do not. 58.2 percent of returning guests do not make any special requests. This percentage is very close to that of first-time visitors.

##   is_repeated_guest         n
## 1                NA 0.5892109
## 2                NA 0.5818898
  1. Customer type of returning visitors

80.7 percent of returning guests make bookings that are not part of a group or contract and are unrelated to other transient bookings. Bookings associated with a group are made by 4.2 percent of repeat guests. Both of these figures are higher than the number of bookings made by first-time visitors.

##   is_repeated_guest         n
## 1                NA 0.7487455
## 2                NA 0.8065617
  1. Distribution channel

13) Histogram illustrating Days in waiting list and cancellations

#SUMMARY

  • Only 3.19 percent of the guests are returning.

  • Repeated guests tend to book the hotel one month in advance of their visit, which is much shorter than that of first-time visitors. This indicates that repeat guests do not rush to book hotels because they always know which hotel to book if they visit that location.

  • The likelihood of a repeat guest cancelling a booking is much lower than that of an unrepeated guest. This indicates that returning visitors are extremely loyal.

  • 75.6 percent of returning guests have no cancellation history. 74.5 percent of repeat guests’ bookings have not been cancelled. The average number of previous bookings that have not been cancelled is 3.59.

  • Guests who return prefer City Hotel slightly more than Resort Hotel. However, the percentage of people who choose Resort Hotel is higher than the percentage of people who choose unrepeated guests.

  • 81.1 percent of returning guests do not change their booking.

  • When booking hotels, 91.1 percent of repeat guests select Bed & Breakfast. Higher than infrequent visitors.

  • 98.2 percent of returning guests do not make a deposit. Higher than infrequent visitors.

  • More than half of returning guests do not make special requests, almost as many as first-time visitors do.

  • 80.7 percent of returning guests make bookings that are not part of a group or contract and are unrelated to other transient bookings.