Google Capstone Project: Bellabeat 🏃‍♀️

Introduction

As part of the Google Data Analytics Professional Certificate, this was the second optional case study provided. I have chosen to also do this project in order to enhance and reinforce my new skills. Like the previous project, this project will outline the following processes as taught in the programme; Ask, Prepare, Process, Analyse, Share and Act. To clean and manipulate the data, R programming will be used due to its vast efficiency. Unlike my previous Capstone Project, R programming will also be used for the visualisations.

Scenario

Bellabeat is a high tech manufacturer of health focused products for women. Although it is a small and successful company, Bellabeat has the potential to become a larger player in the global smart device market. Urška Sršen, the cofounder and Chief Creative Officer, believes new growth opportunities can be unlocked for the company by analysing the smart device fitness data. Therefore, the data analytics team wants to analyse one of their products in order to discover how consumers are using their smart devices. These insights will then guide and improve the current marketing strategy.

About the company

Established in 2013, cofounders Urška Sršen and Sando Mur have made Bellabeat a growing success and rapidly positioned the company as a tech-driven wellness company for women. Sršen expertise has allowed the development of an appealing designed technology that informs and inspires women around the world by tracking data on their activity, sleep, stress and reproductive health.

The marketing team have been asked to focus on a Bellabeat product and analyse its smart device usage to gain insights into how people are using it. This information will then allow a high level recommendation for how these trends can enhance Bellabeat’s marketing strategy and reveal more growth opportunities.

Business Task

Analyse one of the company’s products and explore the smart device data to gain insight into how customers use non-Bellabeat smart devices.

Product

Bellabeat app: The app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data helps users better understand their current habits and make healthier decisions. The app also connects to their line of smart wellness products.

Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the company’s executive team
Bellabeat’s marketing analytics team: A team of data analysts responsible for collecting, analysing and reporting data that helps guide Bellabeat’s marketing strategy

Ask

The following three questions will guide the analysis:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Prepare

The data used in this project was the FitBit Fitness Tracker Data (CC0: Public Domain, data set made available through Mobius). The data set was created by respondents to a survey distributed on Amazon Mechanical Turk and corresponded to the period between 12.03.2016 and 12.05.2016. It explores smart device users’ daily habits and contains personal fitness tracker from thirty FitBit users. These users consented to the submission of their personal tracker data that consists of their minute level output for physical activity, heart rate, and sleep. There is a variation between output as it represents the different types of FitBit trackers used and personalised tracking preferences.

Data organisation

Eighteen CSV files are zipped into the data set and incorporates specific tracking information such as, daily calories, daily steps, activity intensity etc. It is organised in a long format whereby an ID column identifies each user, whilst the rest focuses on the attributes about the user.

Data integrity

Considering the sample size of the data set of only 33 FitBit users and 2 months worth of information, it is important to note that the data is not a complete representation of all FitBit users. In addition the data set is from seven years ago, thus users preferences and habits may have changed. Overall, this is potential credibility bias due to the aforementioned reasons as well as the data being collected by another party.

Despite this, the analysis will still continue and discover insightful information regarding the surveyed users and how they made use of their tracking devices over the period of time.

Data security

As the data set is from an open source public platform on Kaggle it is said to have little integrity. To address the privacy and security the data is anonymised with no personal identifiable information in the data. Instead it is classified with the use of unique IDs which linked to the different data set.

Load required R packages

library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(readr)
library(dplyr)
library(tidyr)
library(data.table)
library(ggpubr)
library(ggrepel)

Import and load the data sets

daily_activity <- read.csv("~/Documents/Bellabeat_Capstone/ dailyActivity_merged.csv")
daily_sleep <- read.csv("~/Documents/Bellabeat_Capstone/ sleepDay_merged.csv")
hourly_steps <- read.csv("~/Documents/Bellabeat_Capstone/ hourlySteps_merged.csv")

Console output from loading data sets

> daily_activity <- read_csv("~/Documents/Bellabeat_Capstone/dailyActivity_merged.csv")
Rows: 940 Columns: 15                                                                                                                                                         
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): ActivityDate
dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, Light...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

> daily_sleep <- read_csv("~/Documents/Bellabeat_Capstone/sleepDay_merged.csv")
Rows: 413 Columns: 5                                                                                                                                                          
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): SleepDay
dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

> hourly_steps <- read_csv("~/Documents/Bellabeat_Capstone/hourlySteps_merged.csv")
Rows: 22099 Columns: 3                                                                                                                                                        
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ActivityHour
dbl (2): Id, StepTotal

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Prepare

To process the data sets, R programming language will be used. In this section the data will be cleaned and manipulated for analysis.

Identify and view data set using functions

head(daily_activity)

Console output

# A tibble: 6 × 15
          **Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveD…¹ Light…² Seden…³ VeryA…⁴ Fairl…⁵ Light…⁶ Seden…⁷ Calor…⁸**
       <dbl> <chr>             <dbl>         <dbl>           <dbl>                    <dbl>              <dbl>               <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 1503960366 4/12/2016         13162          8.5             8.5                         0               1.88               0.550    6.06       0      25      13     328     728    1985
2 1503960366 4/13/2016         10735          6.97            6.97                        0               1.57               0.690    4.71       0      21      19     217     776    1797
3 1503960366 4/14/2016         10460          6.74            6.74                        0               2.44               0.400    3.91       0      30      11     181    1218    1776
4 1503960366 4/15/2016          9762          6.28            6.28                        0               2.14               1.26     2.83       0      29      34     209     726    1745
5 1503960366 4/16/2016         12669          8.16            8.16                        0               2.71               0.410    5.04       0      36      10     221     773    1863
6 1503960366 4/17/2016          9705          6.48            6.48                        0               3.19               0.780    2.51       0      38      20     164     539    1728
# … with abbreviated variable names ¹ModeratelyActiveDistance, ²LightActiveDistance, ³SedentaryActiveDistance, ⁴VeryActiveMinutes, ⁵FairlyActiveMinutes, ⁶LightlyActiveMinutes,
#   ⁷SedentaryMinutes, ⁸Calories

str(daily_activity)

Console output

spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
 $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
 $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   ActivityDate = col_character(),
  ..   TotalSteps = col_double(),
  ..   TotalDistance = col_double(),
  ..   TrackerDistance = col_double(),
  ..   LoggedActivitiesDistance = col_double(),
  ..   VeryActiveDistance = col_double(),
  ..   ModeratelyActiveDistance = col_double(),
  ..   LightActiveDistance = col_double(),
  ..   SedentaryActiveDistance = col_double(),
  ..   VeryActiveMinutes = col_double(),
  ..   FairlyActiveMinutes = col_double(),
  ..   LightlyActiveMinutes = col_double(),
  ..   SedentaryMinutes = col_double(),
  ..   Calories = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

head(daily_sleep)

Console output

# A tibble: 6 × 5
					**Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalTimeInBed**
			 <dbl> <chr>                             <dbl>              <dbl>          <dbl>
1 1503960366 4/12/2016 12:00:00 AM                 1                327            346
2 1503960366 4/13/2016 12:00:00 AM                 2                384            407
3 1503960366 4/15/2016 12:00:00 AM                 1                412            442
4 1503960366 4/16/2016 12:00:00 AM                 2                340            367
5 1503960366 4/17/2016 12:00:00 AM                 1                700            712
6 1503960366 4/19/2016 12:00:00 AM                 1                304            320

str(daily_sleep)

Console output

spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
$ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
$ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
$ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
- attr(*, "spec")=
.. cols(
.. Id = col_double(),
.. SleepDay = col_character(),
.. TotalSleepRecords = col_double(),
.. TotalMinutesAsleep = col_double(),
.. TotalTimeInBed = col_double()
.. )
- attr(*, "problems")=<externalptr>

head(hourly_steps)

Console output

# A tibble: 6 × 3
				  **Id ActivityHour          StepTotal**
			 <dbl> <chr>                     <dbl>
1 1503960366 4/12/2016 12:00:00 AM       373
2 1503960366 4/12/2016 1:00:00  AM       160
3 1503960366 4/12/2016 2:00:00  AM       151
4 1503960366 4/12/2016 3:00:00  AM         0
5 1503960366 4/12/2016 4:00:00  AM         0
6 1503960366 4/12/2016 5:00:00  AM         0

str(hourly_steps)

Console output

spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
$ StepTotal : num [1:22099] 373 160 151 0 0 ...
- attr(*, "spec")=
.. cols(
.. Id = col_double(),
.. ActivityHour = col_character(),
.. StepTotal = col_double()
.. )
- attr(*, "problems")=<externalptr>

Count number of participants in each data set

n_distinct(daily_activity$Id)

Console output

[1] 33

n_distinct(daily_sleep$Id)

Console output

[1] 24

n_distinct(hourly_steps$Id)

Console output

[1] 33

Data cleaning

The data sets need to be identified, sorted and filtered for any null values and duplicates. This was achieved by running the following code.

Check for duplicates

sum(duplicated(daily_activity))

Console output

[1] 0

sum(duplicated(daily_sleep))

Console output

[1] 3

sum(duplicated(hourly_steps))

Console output

[1] 0

Remove the duplicates

daily_sleep <- daily_sleep %>%
	distinct() %>%
	drop_na()

Verify removal of duplicates

sum(duplicated(daily_sleep))

Console output

[1] 0

Data manipulation

The data was further altered to improve organisation and easier readability.

Make the date and time consistent

daily_activity <- daily_activity %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

head(daily_activity)

Console output

A tibble: 6 × 15
          **id date       totalsteps totaldistance trackerdistance loggedactivitiesdistance veryactivedistance moderatelyactivedist…¹ light…² seden…³ verya…⁴ fairl…⁵ light…⁶ seden…⁷ calor…⁸**
       <dbl> <date>          <dbl>         <dbl>           <dbl>                    <dbl>              <dbl>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 1503960366 2016-12-04      13162          8.5             8.5                         0               1.88                  0.550    6.06       0      25      13     328     728    1985
2 1503960366 2016-01-05      10602          6.81            6.81                        0               2.29                  1.60     2.92       0      33      35     246     730    1820
3 1503960366 2016-02-05      14727          9.71            9.71                        0               3.21                  0.570    5.92       0      41      15     277     798    2004
4 1503960366 2016-03-05      15103          9.66            9.66                        0               3.73                  1.05     4.88       0      50      24     254     816    1990
5 1503960366 2016-04-05      11100          7.15            7.15                        0               2.46                  0.870    3.82       0      36      22     203    1179    1819
6 1503960366 2016-05-05      14070          8.90            8.90                        0               2.92                  1.08     4.88       0      45      24     250     857    1959
# … with abbreviated variable names ¹moderatelyactivedistance, ²lightactivedistance, ³sedentaryactivedistance, ⁴veryactiveminutes, ⁵fairlyactiveminutes, ⁶lightlyactiveminutes,
#   ⁷sedentaryminutes, ⁸calories

daily_sleep <- daily_sleep %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format = "%d/%m/%Y %S:%M:%I %p"  , tz=Sys.timezone()))

head(daily_sleep)

Console output

# A tibble: 6 × 5
          **id date       totalsleeprecords totalminutesasleep totaltimeinbed**
       <dbl> <date>                 <dbl>              <dbl>          <dbl>
1 1503960366 2016-04-12                 1                327            346
2 1503960366 2016-04-13                 2                384            407
3 1503960366 2016-04-15                 1                412            442
4 1503960366 2016-04-16                 2                340            367
5 1503960366 2016-04-17                 1                700            712
6 1503960366 2016-04-19                 1                304            320

hourly_steps <- hourly_steps %>%
  rename(date_time = ActivityHour) %>%
  mutate(date_time = as.POSIXct(date_time, format = "%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

head(hourly_steps)

Console output

# A tibble: 6 × 3
          **id date_time           steptotal**
       <dbl> <dttm>                  <dbl>
1 1503960366 2016-04-12 00:00:00       373
2 1503960366 2016-04-12 01:00:00       160
3 1503960366 2016-04-12 02:00:00       151
4 1503960366 2016-04-12 03:00:00         0
5 1503960366 2016-04-12 04:00:00         0
6 1503960366 2016-04-12 05:00:00         0

Merge daily_activity and daily_sleep for any existing correlations

daily_activity_and_sleep <- merge(daily_activity, daily_sleep, by=c ("id", "date"))

glimpse(daily_activity_and_sleep)

Console output

Rows: 410
Columns: 18
$ Id                       <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960…
$ date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 2016-04-17, 2016-04-19, 2016-04-20, 2016-04-21, 2016-04-23, 2016-04-24, 2016-04-25, 2016-04-26, 2016-0…
$ TotalSteps               <int> 13162, 10735, 9762, 12669, 9705, 15506, 10544, 9819, 14371, 10039, 15355, 13755, 13154, 11181, 14673, 10602, 14727, 15103, 14070, 12159, 11992, 10060, …
$ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.34, 9.04, 6.41, 9.80, 8.79, 8.53, 7.15, 9.25, 6.81, 9.71, 9.66, 8.90, 8.03, 7.71, 6.58, 7.72, 7.77, 8.13, 2…
$ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.34, 9.04, 6.41, 9.80, 8.79, 8.53, 7.15, 9.25, 6.81, 9.71, 9.66, 8.90, 8.03, 7.71, 6.58, 7.72, 7.77, 8.13, 2…
$ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.34, 2.81, 2.92, 5.29, 2.33, 3.54, 1.06, 3.56, 2.29, 3.21, 3.73, 2.92, 1.97, 2.46, 3.53, 3.45, 3.35, 2.56, 0…
$ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.35, 0.87, 0.21, 0.57, 0.92, 1.16, 0.50, 1.42, 1.60, 0.57, 1.05, 1.08, 0.25, 2.12, 0.32, 0.53, 1.16, 1.01, 0…
$ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.65, 5.36, 3.28, 3.94, 5.54, 3.79, 5.58, 4.27, 2.92, 5.92, 4.88, 4.88, 5.81, 3.13, 2.73, 3.74, 3.26, 4.55, 2…
$ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ VeryActiveMinutes        <int> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 31, 48, 16, 52, 33, 41, 50, 45, 24, 37, 44, 46, 46, 36, 0, 9, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, …
$ FairlyActiveMinutes      <int> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23, 28, 12, 34, 35, 15, 24, 24, 6, 46, 8, 11, 31, 23, 0, 71, 7, 0, 0, 0, 7, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, …
$ LightlyActiveMinutes     <int> 328, 217, 209, 221, 164, 264, 205, 211, 262, 238, 216, 279, 189, 243, 217, 246, 277, 254, 250, 289, 175, 203, 206, 214, 251, 120, 402, 148, 295, 176, 1…
$ SedentaryMinutes         <int> 728, 776, 726, 773, 539, 775, 818, 838, 732, 709, 814, 833, 782, 815, 712, 730, 798, 816, 857, 754, 833, 574, 835, 746, 669, 1193, 816, 682, 991, 527, …
$ Calories                 <int> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 1775, 1949, 1788, 2013, 1970, 1898, 1837, 1947, 1820, 2004, 1990, 1959, 1896, 1821, 1740, 1819, 1859, 1783, 2…
$ TotalSleepRecords        <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ TotalMinutesAsleep       <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 277, 245, 366, 341, 404, 369, 277, 273, 247, 334, 331, 594, 338, 383, 285, 119, 124, 796, 137, 644, 7…
$ TotalTimeInBed           <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 323, 274, 393, 354, 425, 396, 309, 296, 264, 367, 349, 611, 342, 403, 306, 127, 142, 961, 154, 961, 9…

Analyse

This section will cover the identified trends and relationships by performing calculations. This will enable a descriptive analysis on the data set.

Calculate average daily steps by users

daily_average <- daily_activity_and_sleep %>%
  group_by(Id) %>%
  summarise(mean_daily_steps = mean(TotalSteps), mean_daily_calories = mean(Calories), mean_daily_sleep = mean(TotalMinutesAsleep))

head(daily_average)

Console output

# A tibble: 6 × 4
          **Id mean_daily_steps mean_daily_calories mean_daily_sleep**
       <dbl>            <dbl>               <dbl>            <dbl>
1 1503960366           12406.               1872.             360.
2 1644430081            7968.               2978.             294 
3 1844505072            3477                1676.             652 
4 1927972279            1490                2316.             417 
5 2026352035            5619.               1541.             506.
6 2320127002            5079                1804               61

Group users by average daily steps

usertype <- daily_average %>%
  mutate(usertype = case_when(
    mean_daily_steps < 5000 ~ "Sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "Lightly Active",
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "Fairly Active",
    mean_daily_steps >= 10000 ~ "Very Active"))

head(usertype)

Console output

# A tibble: 6 × 5
          **Id mean_daily_steps mean_daily_calories mean_daily_sleep usertype**      
       <dbl>            <dbl>               <dbl>            <dbl> <chr>         
1 1503960366           12406.               1872.             360. Very Active   
2 1644430081            7968.               2978.             294  Fairly Active 
3 1844505072            3477                1676.             652  Sedentary     
4 1927972279            1490                2316.             417  Sedentary     
5 2026352035            5619.               1541.             506. Lightly Active
6 2320127002            5079                1804               61  Lightly Active

Calculate usertype percentage

Since adding a new column “usertype” a data frame will be created with the percentages of each user. Ultimately allowing for a better visualisation.

usertype_percent <- usertype %>%
  group_by(usertype) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usertype) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales :: percent(total_percent))

usertype_percent$usertype <- factor(usertype_percent$usertype, levels = c("Very Active", "Fairly Active", "Lightly Active", "Sedentary")) 

head(usertype_percent)

Console output

# A tibble: 4 × 3
  **usertype       total_percent labels**
  <fct>                  <dbl>  <chr> 
1 Fairly Active          0.375    38%   
2 Lightly Active         0.208    21%   
3 Sedentary              0.208    21%   
4 Very Active            0.208    21%

This section will display all the information that has been discovered and will discuss the findings to help stakeholders make an accurate informed decision. To visualise this, R Studio was used to create diagrams and charts below.

Visualise usertype distribution

usertype_percent %>%
  ggplot(aes(x="", y=total_percent, fill=usertype)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
              labs(title = "Usertype Distribution") +
              theme_minimal()+
              theme(axis.title.x = element_blank(),
                    axis.title.y = element_blank(),
                    panel.border = element_blank(),
                    panel.grid = element_blank(),
                    axis.ticks = element_blank(),
                    axis.text.x = element_blank(),
                    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
              scale_fill_manual(values = c("#ff477e", "#e05780","#ff7aa2", "#ff9ebb"))

Console output bella1

Majority of the users are “Fairly Active.”

Calculate average steps walked and minutes slept by weekday

weekday_steps_and_sleep <- daily_activity_and_sleep %>%
  mutate(weekday = weekdays(date))

weekday_steps_and_sleep$weekday <- ordered(weekday_steps_and_sleep$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

weekday_steps_and_sleep <- weekday_steps_and_sleep %>%
  group_by(weekday) %>%
  summarise(daily_steps = mean(TotalSteps), daily_sleep = mean(TotalMinutesAsleep))

head(weekday_steps_and_sleep, 7)

Console output

# A tibble: 7 × 3
  **weekday   daily_steps daily_sleep**
    <ord>         <dbl>       <dbl>
1 Monday           9273         420
2 Tuesday          9183         405
3 Wednesday        8023         435
4 Thursday         8184         401
5 Friday           7901         405
6 Saturday         9871         419
7 Sunday           7298         453

Visualise weekly daily steps

ggplot(weekday_steps_and_sleep) +
  geom_col(aes(weekday, daily_steps), fill = "#ff7aa2") +
  geom_hline(yintercept = 7500) +
  labs(title =  "Weekly daily steps", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1)) +
  theme_bw()

Console output bella2

All users are walking more than the recommended steps of 7,500.
Saturday is the most active day with nearly 10,000 daily steps.
Sunday is the least active day with below the recommended daily steps.

Visualise Weekly minutes slept

ggplot(weekday_steps_and_sleep, aes(weekday, daily_sleep)) +
  geom_col(fill = "#ff7aa2") +
  geom_hline(yintercept = 405) +
  labs(title = "Weekly minutes slept", x = "", y = "") +
  theme(axis.title.x = element_text(angle = 45, vjust = 0.5, hjust = 1)) +
  theme_bw()

Console output bella3

Users are sleeping the recommended hours apart from Thursday.
Users are sleeping the most on Sundays.

Calculate hourly steps

hourly_steps <- hourly_steps %>%
  separate(date_time, into = c("date", "time"), sep = " ") %>%
  mutate(date = ymd(date))

head(hourly_steps)

Console output

# A tibble: 6 × 4
Id       date     time StepTotal
1 1503960366 2016-04-12 00:00:00       373
2 1503960366 2016-04-12 01:00:00       160
3 1503960366 2016-04-12 02:00:00       151
4 1503960366 2016-04-12 03:00:00         0
5 1503960366 2016-04-12 04:00:00         0
6 1503960366 2016-04-12 05:00:00         0

Visualise Hourly Steps of the Day

hourly_steps %>%
  group_by(time) %>%
  summarise(average_steps = mean(StepTotal)) %>%
  ggplot() +
  theme_bw() +
  geom_col(mapping = aes(x = time, y = average_steps, fill = average_steps)) +
  labs(title = "Hourly steps of the day", x = "", y = "") +
  scale_fill_gradient(low = "lightpink", high = "#ff477e") +
  theme(axis.title.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Console output bella4

Low steps from midnight until 6am as it is sleeping hours.
Steps gradually increase from 7am onwards.
Steps peak around 12pm until 2pm during lunch hours and again at 5pm until 7pm.

Daily Steps vs Minutes Slept correlation

ggplot(daily_activity_and_sleep, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_jitter() +
  geom_smooth(colour = "#ff477e") +
  labs(title =  "Daily steps Vs Minutes slept", x = "Daily steps", y = "Minutes slept") +
  theme(panel.background = element_blank(), plot.title = element_text(size = 14))

Console output bella5

There is no correlation between daily steps and the minutes users sleep.
The amount of steps a user takes does not affect their sleeping pattern.

Daily Steps vs Calories correlation

ggplot(daily_activity_and_sleep, aes(x = TotalSteps, y = Calories)) +
  geom_jitter() +
  geom_smooth(colour = "#ff477e") +
  labs(title = "Daily steps Vs Calories", x = "Daily steps", y = "Calories") +
  theme(panel.background = element_blank(), plot.title = element_text(size = 14))

Console output bella6

There is a positive correlation between daily steps and the calories burnt.
The more steps users take, the more calories they will burn.

Calculate number of daily use

daily_use <- daily_activity_and_sleep %>%
  group_by(Id) %>%
  summarise(days_used = sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "Low use",
    days_used >= 11 & days_used <= 20 ~ "Moderate use",
    days_used >= 21 & days_used <= 31 ~ "High use",))

head(daily_use,7)

Console output

# A tibble: 7 × 3
          **Id days_used usage**       
       <dbl>     <int> <chr>       
1 1503960366        25 High use    
2 1644430081         4 Low use     
3 1844505072         3 Low use     
4 1927972279         5 Low use     
5 2026352035        28 High use    
6 2320127002         1 Low use     
7 2347167796        15 Moderate use

Calculate daily use percentage

daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c ("High use", "Moderate use", "Low use"))

head(daily_use_percent)

Console output

# A tibble: 3 × 3
  **usage        total_percent labels**
  <fct>                <dbl>  <chr> 
1 High use             0.5      50%   
2 Low use              0.375    38%   
3 Moderate use         0.125    12%

Visualise daily use percentage

This will calculate the number of users using their smart devices.

daily_use_percent %>%
  ggplot(aes(x = "", y = total_percent, fill = usage)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c ("#ff477e","#ff7aa2", "#ff9ebb"),
                    labels = c("High use - 21 to 31 days", "Moderate use - 11 to 20 days", "Low use - 1 to 10 days")) +
  labs(title = "Daily smart device use")

Console output bella7

Majority of the users are highly active on their smart devices and use it between 21 to 31 days.
12% of users use their smart devices between 11 to 20 days.
More than 30% of users rarely use their smart devices, ranging from 1 to 10 days.

Merge daily activity and daily use

This will determine how often users wear their smart devices.

daily_use_and_activity <- merge(daily_activity, daily_use, by = c("Id"))

head(daily_use_and_activity)

Console output

Id                 date TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
1 1503960366 2016-05-07      11992          7.71            7.71                        0               2.46                     2.12                3.13                       0
2 1503960366 2016-05-06      12159          8.03            8.03                        0               1.97                     0.25                5.81                       0
3 1503960366 2016-05-01      10602          6.81            6.81                        0               2.29                     1.60                2.92                       0
4 1503960366 2016-04-30      14673          9.25            9.25                        0               3.56                     1.42                4.27                       0
5 1503960366 2016-04-12      13162          8.50            8.50                        0               1.88                     0.55                6.06                       0
6 1503960366 2016-04-13      10735          6.97            6.97                        0               1.57                     0.69                4.71                       0
  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories days_used    usage
1                37                  46                  175              833     1821        25 High use
2                24                   6                  289              754     1896        25 High use
3                33                  35                  246              730     1820        25 High use
4                52                  34                  217              712     1947        25 High use
5                25                  13                  328              728     1985        25 High use
6                21                  19                  217              776     1797        25 High use

Calculate and group device usage in minutes

Here, another data frame was created to categorise the time period the users wear their smart devices.

minutes_device_worn <- daily_use_and_activity %>%
  mutate(total_minutes_wearing = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes) %>%
  mutate(percent_minutes_wearing = (total_minutes_wearing / 1440) * 100) %>%
  mutate(wearing = case_when(
    percent_minutes_wearing == 100 ~ "All day",
    percent_minutes_wearing < 100 & percent_minutes_wearing >= 50 ~ "More than half day",
    percent_minutes_wearing < 50 & percent_minutes_wearing > 0 ~ "Less than half day"))

head(minutes_device_worn)

Console output

**Id                 date TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance**
1 1503960366 2016-05-07      11992          7.71            7.71                        0               2.46                     2.12                3.13                       0
2 1503960366 2016-05-06      12159          8.03            8.03                        0               1.97                     0.25                5.81                       0
3 1503960366 2016-05-01      10602          6.81            6.81                        0               2.29                     1.60                2.92                       0
4 1503960366 2016-04-30      14673          9.25            9.25                        0               3.56                     1.42                4.27                       0
5 1503960366 2016-04-12      13162          8.50            8.50                        0               1.88                     0.55                6.06                       0
6 1503960366 2016-04-13      10735          6.97            6.97                        0               1.57                     0.69                4.71                       0
  **VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories days_used    usage total_minutes_wearing percent_minutes_wearing            wearing**
1                37                  46                  175              833     1821        25 High use                  1091                75.76389 More than half day
2                24                   6                  289              754     1896        25 High use                  1073                74.51389 More than half day
3                33                  35                  246              730     1820        25 High use                  1044                72.50000 More than half day
4                52                  34                  217              712     1947        25 High use                  1015                70.48611 More than half day
5                25                  13                  328              728     1985        25 High use                  1094                75.97222 More than half day
6                21                  19                  217              776     1797        25 High use                  1033                71.73611 More than half day

Calculate device usage percentage

This data frame will illustrate the total number of users and the percentage of minutes they wore their devices for.

minutes_worn_percent <- minutes_device_worn %>%
  group_by(wearing) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(wearing) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales:: percent(total_percent))

head(minutes_worn_percent)

Console output

# A tibble: 3 × 3
  **wearing            total_percent labels**
  <chr>                      <dbl>  <chr> 
1 All day                   0.365     36%   
2 Less than half day        0.0351     4%    
3 More than half day        0.600     60%

The next calculations will create three more data frames which will be filtered by the daily users category. This will then be able to differentiate the daily use and time use.

Calculate device usage by “High use” filter

minutes_worn_high_use <- minutes_device_worn %>%
  filter(usage == "High use") %>%
  group_by(wearing) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(wearing) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

head(minutes_worn_high_use)

Console output

# A tibble: 3 × 3
  **wearing            total_percent labels**
  <chr>                      <dbl>  <chr> 
1 All day                   0.0676   6.8%  
2 Less than half day        0.0432   4.3%  
3 More than half day        0.889   88.9%

Calculate device usage by “Moderate use” filter

minutes_worn_mod_use <- minutes_device_worn %>%
  filter(usage == "Moderate use") %>%
  group_by(wearing) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(wearing) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

head(minutes_worn_mod_use)

Console output

# A tibble: 3 × 3
  **wearing            total_percent labels**
  <chr>                      <dbl> <chr> 
1 All day                    0.267 27%   
2 Less than half day         0.04  4%    
3 More than half day         0.693 69%

Calculate device usage by “Low use” filter

minutes_worn_low_use <- minutes_device_worn %>%
  filter(usage == "Low use") %>%
  group_by(wearing) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(wearing) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

head(minutes_worn_low_use)

Console output

# A tibble: 3 × 3
  **wearing            total_percent labels**
  <chr>                      <dbl>  <chr> 
1 All day                   0.802     80%   
2 Less than half day        0.0224     2%    
3 More than half day        0.175     18%

Visualise time worn per day

ggplot(minutes_worn_percent, aes(x="",y=total_percent, fill=wearing)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#ff477e","#ff7aa2", "#ff9ebb")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title="Time worn per day", subtitle = "Total Users")

Console output bella8

More than 30% of users wear their devices “All day.”
4% of users wear their devices “Less than half a day.”
Majority of the users wear their devices “More than half a day.”

Visualise time worn per day by “High use” users

ggplot(minutes_worn_high_use, aes(x="",y=total_percent, fill=wearing)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "none") +
  scale_fill_manual(values = c("#ff477e","#ff7aa2", "#ff9ebb")) +
  geom_text_repel(aes(label = labels),
                  position = position_stack(vjust = 0.5), size = 3) +
  labs(title="Time worn per day", subtitle = "High use users")

Console output ![bella9](/bella9.png “Time worn per day - High use users)

6.8% of users wore and used their devices “All day”.
88.9% of users wore and used their devices “More than half a day.”
4.3% of users wore and used their devices “Less than half a day.”

Visualise time worn per day by “Moderate use” users

ggplot(minutes_worn_mod_use, aes(x="",y=total_percent, fill=wearing)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
  scale_fill_manual(values = c("#ff477e","#ff7aa2", "#ff9ebb")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title="Time worn per day", subtitle = "Moderate use users")

Console output ![bella10](/bella10.png “Time worn per day - Moderate use users)

27% of users used their smart devices for the entire day.
69% of users used their smart devices for “More than half a day.”
4% of users rarely use their smart devices.

Visualise time worn per day by “Low use” users

ggplot(minutes_worn_low_use, aes(x="",y=total_percent, fill=wearing)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
  scale_fill_manual(values = c("#ff477e","#ff7aa2", "#ff9ebb")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Time worn per day", subtitle = "Low use users")

Console output ![bella11](/bella11.png “Time worn per day - low use users)

80% of users use their smart devices for the entire day.
18% of users use their smart devices for “More than half a day.”
2% of users rarely use their smart devices.

Act

This section will cover the recommendations based on the information and findings discovered. Bellabeat needs to focus on tracking their own data to respond and assist to their mission and business task.

In order to come up with the correct marketing strategy for it’s products Bellabeat could create the following:

Calorie counter: A counter could encourage a user to use the app and smart device a lot more with an appealing designed interface which displays the number of calories that are burnt during use. To further improve user motivation, customisable features could be added to allow users to be in control of their apps and create “goals”.
Sleep log: Although, the sampled users have consistent sleeping patterns, some users may want to track their sleeping habits. This feature could boost Bellabeat app use a lot, as it would enable to recording of number of wakes during the night, sleep quality, and total awake time in bed. With this data, the app could recommend the user a certain time to be in bed by and this could be customisable by the user.
Push notifications: As discovered from the findings, some users are not walking the recommended steps and not sleeping the recommended hours. In order to combat this, Bellabeat could push notifications to users if they are low in steps for the day, close to their “goals” or if it’s nearing their bed time. This would motivate users to make more use of their apps and smart devices.
Daily, weekly and monthly achievements: The Bellabeat app could provide users with insightful and interactive reports based on their steps, calories burnt, and sleeping habits. Badges and messages could be sent to users to motivate them and encourage them to continue with their good habits.
Daily health words of wisdom and improvements: Bellabeat could push daily words of wisdom on health in the app to motivate and encourage their users to use the app or their smart devices. This can also include tips or improvements on the users overall performance to get them to achieve their “goals”.
Discounts on other products: Special discounts on different Bellabeat products can be offered to users in order to retain and keep them motivated. This will act as an incentive and incline users to purchase more products and stay active. Moreover, discounts can be given on Bellabeat’s premium membership as well.

No further recommendation was made to Bellabeat as the data set could have been biased as it was limited due to having a very small sample size and a lack in user demographic information such as age. Having this information available could have been extremely useful in gaining movement insights and trends of women in different age groups.