SQQS2073 ( A ) REGRESSION MODELING SECOND SEMESTER SESSION 2022/2023 (A222) GROUP PROJECT TITLE: A STUDY OF THE RELATIONSHIP OF FACTORS AFFECTING QUALITY OF SLEEP PREPARED FOR: DR. KAMAL BIN KHALID PREPARED BY: NO. MATRIC NO. NAME 1 284286 NUR AIN BINTI ABDULLAH 2 284420 NUR SHAHIEFFA ZETTY BT MUHAMMAD SABRI 3 287932 NUR ANIS SYAZLIANA BINTI SHAFIDAN 4 288057 AMIRA ATIKAH BINTI AHMAD ZULKARNAIN 5 288221 KHAIRINA BINTI KHAIRUDDIN
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) A Study of The Relationship of Factors Affecting Quality of Sleep Nur Ain binti Abdullah(284286) 1 , Nur Shahieffa Zetty binti Muhd Sabri(284420) 2, , Nur Anis Syazliana binti Shafidan(287932) 3 , Amira Atikah binti Ahmad Zulkarnain(288057) 4 , Khairina binti Khairuddin(288221) 5 1,2,3,4,5School of Quantitative Sciences, Universiti Utara Malaysia, 06010 UUM Sintok, Kedah *Email: [email protected] , [email protected] , [email protected] , [email protected], [email protected] . Abstract This study will examine the relationship between an individual’s quality of sleep with the following factors: age, physical activity, daily steps, heart rate and sleep duration. We build a multiple linear regression model using 374 samples to assess the model’s quality of sleep and the relationship between them in order to determine the solution. The model includes 6 independent variables (age, physical activity, daily steps, heart rate and sleep duration) and 1 dependent variable (quality sleep). Our findings demonstrate that all independent variables are significant. Therefore, it concludes that the quality of sleep depends on the age, physical activity, daily step, heart rate and sleep duration. This study can help to estimate the quality of sleep based on the factors affecting it. Keywords Factors, Quality of sleep, Regression 1. Introduction 1.1 Backgrounds of Study Quality of sleep is a fundamental aspect of overall well-being, and its significance in maintaining optimal health cannot be overstated. This research endeavours to shed light on the various factors that impact the quality of sleep, aiming to provide valuable insights into understanding and managing this essential physiological measure. The main objective of this study is to investigate the relationship between different factors that contribute to the quality of sleep experienced by individuals. The main factors examined are age, physical activity, daily steps, heart rate and sleep duration. By studying these factors, the research aims to uncover valuable insights into how they influence and interact with each other. The data were obtained from the website called ‘Kaggle’ where a total of 375 samples were available. 1.2 Objectives of the Study The general objective of this project is to identify the factors that affect the quality of sleep experienced by individuals. The specific objectives of the project as follows: 1. To investigate the relationship between age and quality of sleep. 2. To investigate the relationship between physical activity and quality of sleep. 3. To investigate the relationship between daily steps and quality of sleep. 4. To investigate the relationship between heart rate and quality of sleep. 5. To investigate the relationship between sleep duration and quality of sleep. 1.3 Scope of the Study This study seeks to explore the factors that affect the quality of sleep to enhance our understanding of sleep patterns and overall sleep health. The findings of this research endeavour will provide valuable insights to individuals which enable them to make informed decisions and adopt healthier lifestyle choices to optimise their quality of sleep and their overall well-being. 2. Methodology 2.1 Introduction In general, data collection is the process of obtaining all of the information or observations from many sources in order to solve the study topic. Primary and secondary data are the two types of data collected. Primary data is obtained directly from the source via experiments, surveys, or observations, whereas secondary data is collected from a previously available source and the information is presented clearly, after which the data is 2
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) analysed. The research strategy used in this study was qualitative. The information we used came from the Kaggle website. The case study focuses on Factors Affecting Quality of Sleep. It is a survey-style study in which respondents must answer questions using a questionnaire. 2.2 Data Collection This report is about a study of the relationship of factors affecting quality of sleep. There are 374 respondents from the dataset that we got. The data that we used is secondary data which was retrieved from Kaggle website (https://www.kaggle.com/datasets/uom190346a/sleep-health -and-lifestyle-dataset?resource=download). The dependant variable of the study is Quality of Sleep () while the independent variables are Sleep Duration ( ), Heart Rate ( ), Age ( ), Physical Activity ( 1 2 3 ) and Daily Steps ( ). The measurements for the quality 4 5 of sleep, sleep duration, heart rate, physical activity and daily steps are interval scale data while the measurements for age are ratio scale data. 2.3 Method of Data Analysis Linear Regression or Ordinary Least Square (OLS) regression is applied in the data analysis of our study. Linear regression determines the linear relationship between two variables using a best-fit line. Linear regression is thus represented graphically by a straight line, with the slope specifying how a change in one variable affects a change in the other. A linear regression relationship's y-intercept represents the value of one variable when the value of the other is zero. When there is only one explanatory variable, the regression model is called simple linear regression. In this study, a multiple linear regression model is applied since there is one dependent variable and five independent variables. The multiple regression model can be written as the following: = 0 + 1 1 + 2 2 + 3 3 + 4 4 + 5 5 +∈ Where: = quality of sleep = expected value of when all independent variables are 0 held constant. which are coefficient of each , ℎ = 1, 2, 3, 4, 5 independent variables Assumptions Generally, there are 5 main assumptions of a multiple regression: 1. Linearity: There is a linear relationship between the independent variables () and the mean of the dependent variable (). 2. Normality: The residual or error are normally distributed with mean equal to zero. 3. Homoscedasticity: The residual or error has constant variance which is the same for all values of independent variables (x). 4. No Multicollinearity : The independent variable (x) should not be correlated with each other. 5. Independence / autocorrelation : The residuals are independent of each other. In Regression Modelling, checking assumptions is a suitable method to achieve the objectives of our study because we can evaluate the overall fit of the model and can determine the strength of the relationship between independent variables and dependent variables. 2.4 Procedure Software For this group project we used the Statistical Package for the Social Sciences (SPSS). SPSS is the software utilised for data analysis. SPSS is widely recognized as a highly potent and extensively utilised statistical program (Mayers, 2013, 11). It comprises four programs designed to assist users in analysing intricate datasets which are a statistics program, a modeller program, a text analytics for survey program, and a visualisation designer. We transferred the data from Excel to SPSS, where it was subjected to analysis employing the Linear Regression Analysis feature within the statistics program. Model Specification and Variable Selection To choose the best model,we used stepwise regression as a technique of variable selection. The stepwise regression method combines elements of both backward elimination and forward selection. It repeatedly adds or removes variables from the list of independent variables. It begins like forward selection by examining the list of all possible explanatory variables in simple regressions and choosing the one with the largest partial F statistic. The hypothesis test for significance is performed, and if the variable is judged important, this variable is added to the model. Each of the remaining variables is then examined. The variable with the largest partial F statistic is chosen, and the hypothesis test for significance is performed on the coefficient of this variable to determine whether it should be added to the model. If the variable is judged important, it is added as in the forward selection procedure. 3
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) At this point, however, the stepwise procedure begins to act like the backward elimination procedure. After adding a new variable to the model, the stepwise procedure retests the coefficients of the previously added variables, deleting these variables if the test judges them to be unnecessary and retaining them otherwise. As the addition of one variable can result in a charge in the partial F statistic associated with another variable, it is possible for the stepwise procedure to allow a variable to enter the equation at one step, delete the variable at a later step, and allow the variable to reenter at an even later step. Once none of the remaining out-of-equation variables test as significant and all of the variables in the equation are judged to be necessary, the stepwise regression procedure terminates. Stepwise regression requires two significance levels: one for adding variables and one for removing variables. Assumption Testing For assumption the linear relationship between dependent variable and independent variables checking method of multicollinearity involves Variation Inflation Factors (VIF). While autocorrelation checking involves method Durbin-Watson test statistics. Hypothesis Testing For hypothesis testing we used analysis of variance (ANOVA) to test equality of means, to determine if there are significant differences and to compare variables between groups. We used a p-value approach to assess the significance of the group differences. 3. Data Analysis and Findings To confirm our data, the assumption testing is done at the outset of the analysis. If the data does not support the premise, transformation is required. Assumption 1 : Linear Function and Assumption 3 : Constant Variable Figure 1: Scatter Plot of Residual Scatter plots with residual to predicted values are used to check assumption 1 and 3. From figure 1, it clearly shows the residual is homoscedasticity and unbiased, the assumption 1 is fulfilled. Also, the residuals are clearly constant variables, the assumption 3 is fulfilled. In conclusion, the data support the hypothesis that the variance is constant and the mean of the residuals is zero. Assumption 2 : Normally Distributed Figure 2.0: histogram normal distribution Figure 2.1: Normal P-P Plot Figure 2.0 and 2.1 illustrates that the residual is approximately normal and the normal probability plot for the residual also shows the residual is normally distributed. Therefore, assumption 2 is fulfilled. Conclude that, it is a normal distribution with mean zero. Assumption 4 : No relationship between Independent Variables. Checking for Multicollinearity : Figure 3.1: Correlation Matrix First, we generate the following correlation matrix. Based on the Figure 3.1, correlation matrix, we can see that none of the correlation between independent variables is at 4
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) 0.9 or above. It indicates there is no multicollinearity problem between all independent variables. Besides, from the tables, we also can see that the relationship between heart rate and average daily step, sleep duration and age are negative relationship as same as the relationship between average daily steps and sleep duration. Then, we generate the following output to determine the multicollinearity; Figure 3.2: Model Summary Multicollinearity Figure 3.3: ANOVA Table Figure 3.4: Coefficients Collinearity Statistics We suspect the multicollinearity problems happen when we perform individual tests on the independent variables. Variation Inflation Factors (VIF) can be used to measure multicollinearity. It is a measure of the strength of the relationship between each variable. Figure 3.4 which is the Coefficients Collinearity statistics resulted in the Tolerance value of each variable is greater than 0.1 and VIF value of each variable is less than 10. From the table, our VIF has low correlation among independent variables that generally do not result in serious deterioration of the quality of the least-square estimates. We calculated the VIF and see that all values are around value 1, which means that multicollinearity is not a problem for independent variables and it is weakly related to each dependent variable. Assumption 5 : Checking linear relationship between Independent Variable and Dependent Variable. Checking Autocorrelation For this section, we are finding the autocorrelation of this project. Autocorrelation exists when successive observations over time are related to one another. It indicates that autocorrelation can occur because the effect of dependent variables on the response is distributed over time. We use the Durbin-Watson test (DW test) to determine serial correlation. First, we find the DW test statistics by using SPSS and the results are shown as below : Figure 4.1: Model Summary Durbin Watson Based on Figure 4.1, we got the value of DW statistics is 0.829. As the value of DW approaches 0, positive serial correlations appear to move severe. When the autocorrelation is present, we test for the positive autocorrelation. The hypothesis to be tested are written as follows: Step 1: H0 : ρ = 0 ( no residual correlation) H1 : ρ > 0 ( positive residual correlation) Step 2 : Test Statistics d = 0.829 Step 3 : Critical Value From Durbin-Watson Figure 2.7, is to test for positive autocorrelation, the following decision rule is used: Figure 4.2: Residual Statistics dL = -1.248 dU = 1.678 Step 4 : Decision d ≥ dL = -1.248 5
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) d ≤ dU = 1.678 Step 5: Conclusion From the Figure 4.2 residual statistics, dL and dU are critical values and depend on the sample size and number of independent variables In our study, dL is -1.248 and dU is 1.678. Because d= 0.829 which is < dU, there is efficient evidence to reject H0 . There is an inconclusive range of values for dL = -1.248 ≤ d ≤ dU = 1.678. Therefore, the autocorrelation coefficient has no positive autocorrelation and more information is needed. Analysis of the Overall Model Figure 5.1: Model Summary of Overall Model Figure 5.2: Coefficient Table of Overall Model Based on the result from the Stepwise method shown in Figure 5.2, it has five possible regression model which are: 1. = − 2. 163 + 1. 3291 (2) 2. = 5. 018 + 1. 1131 − 0. 0802 (3) 3. = 4. 343 + 1. 0301 − 0. 0772 + 0. 0253 (4) 4. = 4. 837 + 0. 9981 − 0. 0832 + 0. 0233 + 0. 0044 (5) 5. = 6. 087 + 0. 9441 − 0. 0932 + 0. 0233 + 0. 0084 − 6. 40510−55 (6) Note that: = Quality of Sleep, = Sleep Duration, = Heart 1 2 Rate, = Age, = Physical Activity, = Daily Steps 3 4 5 Based on all of five model, the best model is Regression Model 5 as the model has highest coefficient of determination, as shown in Figure 5.1 which 2 = 0. 870 can be interpreted as 87.0% variance in Quality of Sleep is explained by independent variables which are Sleep Duration, Heart Rate, Age, Physical Activity, Daily Steps. The correlation coefficient, r = 0.933 which the interpretation is there is strong positive linear relationship between Quality of Sleep and independent variables which are Sleep Duration, Heart Rate, Age, Physical Activity, Daily Steps. Significant Test for Regression Model Figure 6: Anova Table of Overall Model Significant Test for overall model H0 : B1 = B2 = B3 = B4 = B5 = 0 6
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) H1 : At least one coefficient is not equal to zero P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, the overall regression model is significant. Significant Test for Individual Regression Coefficient Sleep Duration H0 : B1 = 0 H1 : B1 ≠ 0 P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, sleep duration is significant. Heart Rate H0 : B2 = 0 H1 : B2 ≠ 0 P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, heart rate is significant. Age H0 : B3 = 0 H1 : B3 ≠ 0 P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, age is significant. Physical Activity H0 : B4 = 0 H1 : B4 ≠ 0 P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, physical activity is significant. Daily Steps H0 : B5 = 0 H1 : B5 ≠ 0 P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, daily steps are significant. The Best Model Figure 7.1: Model Summary of Best Model Figure 7.2: Anova Table of Best Model Significance test for Best Model H0 : B1 = B2 = B3 = B4 = B5 = 0 H1 : At least one coefficient is not equal to zero P-value = 0.001 α = 0.05 P-value < α (Reject H0 ) Thus, the overall of the best regression model is significant. 7
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) Figure 8: Coefficient Table of Best Model Based on the result from the Stepwise method, the best regression model is the fifth model as shown in Figure 8 as the model has highest coefficient of determination, 2 where the model is: = 6. 087 + 0. 9441 − 0. 0932 + 0. 0233 + 0. 0084 − 6. 40510−55 (7) All of the indicator variables are supposedly significant based on output as there is no excluded variable. The following hypothesis testing was carried out to determine the significance of each indicator variable on the dependant variable at the 1% significance level (α = 0. 01 ). Sleep Duration H0 : B1 = 0 H1 : B1 ≠ 0 P-value = 0.001 α = 0.01 P-value < α (Reject H0 ) Thus, sleep duration is significant. Heart Rate H0 : B2 = 0 H1 : B2 ≠ 0 P-value = 0.001 α = 0.01 P-value < α (Reject H0 ) Thus, heart rate is significant. Age H0 : B3 = 0 H1 : B3 ≠ 0 P-value = 0.001 α = 0.01 P-value < α (Reject H0 ) Thus, age is significant. Physical Activity H0 : B4 = 0 H1 : B4 ≠ 0 P-value = 0.001 α = 0.01 P-value < α (Reject H0 ) Thus, physical activity is significant. Daily Steps H0 : B5 = 0 H1 : B5 ≠ 0 P-value = 0.019 α = 0.01 P-value < α (Reject H0 ) Thus, daily steps are significant. 4. Conclusion As mentioned before, the main objective of this study is to investigate the relationship between different factors that contribute to the quality of sleep experienced by individuals. The main factors examined in this study are age, physical activity, daily steps, heart rate and sleep duration in which were set up as the independent variables. We used various procedures which are by SPSS software, regression assumption, and hypothesis testing to identify the relationship thus obtaining the best regression model. Therefore, as we conducted this study, we obtained that all of the five assumptions required for this multiple linear regression model have been successfully met. By applying the stepwise method, a few possible models were identified instantly and then the best regression model would be chosen based on the closest the value of coefficient of determination toward 1 which is Equation (7). Then, the overall best model and all the remaining 8
UUM | SQQS2073 (A) REGRESSION MODELLING (SESSION A222) explanatory variables in the model were found to be significant at a 99% confidence level by conducting hypothesis testing. In conclusion, the regression model that we obtained is significant as all of the independent and dependent met the assumptions which can be said that the model has a relationship between dependent variables and each indicator variable while there is no relationship between the independent variables. Acknowledgements We would like to express our sincere gratitude and appreciation to all those who have contributed to the successful completion of this study on the relationship of factors affecting the quality of sleep. First and foremost, we extend our heartfelt thanks to our lecturer, Sir Kamal Bin Khalid, for his invaluable guidance, unwavering support, and continuous encouragement throughout the study process. His expertise and insights have been instrumental in shaping the direction and focus of this study. Furthermore, we would like to acknowledge the authors and researchers whose work and studies have laid the foundation for this research. Their contributions have been pivotal in framing the theoretical framework and establishing the context for this study. Last but not least, we are grateful to our family, friends, and colleagues for their continuous support, understanding, and motivation throughout this endeavour. Their encouragement and belief in our abilities have been a constant source of inspiration. Without the collective efforts and support of these individuals and organisations, this study would not have been possible. We are truly grateful for their contributions, and we acknowledge their immense impact on the successful completion of this research. 9