{"id":255,"date":"2023-06-12T21:20:46","date_gmt":"2023-06-12T20:20:46","guid":{"rendered":"https:\/\/zdebla.me\/?p=255"},"modified":"2023-06-12T21:20:46","modified_gmt":"2023-06-12T20:20:46","slug":"simple-linear-regression-how-dataset-sample-influences-the-outcome","status":"publish","type":"post","link":"https:\/\/zdebla.me\/index.php\/2023\/06\/12\/simple-linear-regression-how-dataset-sample-influences-the-outcome\/","title":{"rendered":"Simple linear regression &#8211; how dataset sample influences the outcome"},"content":{"rendered":"\n<p>First I wanted to create a simple linear regression, plotting the number of flights and the number of delays in the SES area (RP1-RP2), for the period of years 2016-2019. I extracted the data from the <a href=\"https:\/\/ansperformance.eu\/data\/\">EUROCONTROL datasets<\/a> &#8211; <em><strong>Daily <\/strong>IFR traffic and en-route ATFM delay by entity and delay cause (FIR based)<\/em> and <em><strong>Monthly <\/strong>IFR traffic and en-route <a href=\"https:\/\/ansperformance.eu\/definition\/atfm-delay\/\">ATFM delay<\/a> by entity and delay cause (FIR based)<\/em>.<\/p>\n\n\n\n<p>The linear regression in monthly data was interesting, with R-squared value of 0.6932.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-27.png\" alt=\"\" class=\"wp-image-257\" width=\"658\" height=\"395\" srcset=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-27.png 480w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-27-300x180.png 300w\" sizes=\"auto, (max-width: 658px) 100vw, 658px\" \/><\/figure>\n\n\n\n<p>However, when I plotted the daily data, the with R-squared value of just 0.2742.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-28.png\" alt=\"\" class=\"wp-image-258\" width=\"650\" height=\"390\" srcset=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-28.png 480w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-28-300x180.png 300w\" sizes=\"auto, (max-width: 650px) 100vw, 650px\" \/><\/figure>\n\n\n\n<p>The difference in R-squared values between the monthly and daily data <strong>suggests that the frequency or granularity of the data can have a notable impact on the performance of linear regression models.<\/strong><\/p>\n\n\n\n<p>To verify that the data are matching, I aggregated the daily data in a pivot table and than compared the results:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"390\" height=\"820\" src=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-29.png\" alt=\"\" class=\"wp-image-259\" srcset=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-29.png 390w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-29-143x300.png 143w\" sizes=\"auto, (max-width: 390px) 100vw, 390px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Bonus: Here I want to compliment how easy it is to add a colour element to the scatter chart in Tableau, with just few clicks. In Excel doing something like this would be quite eleborate. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-26-1024x901.png\" alt=\"\" class=\"wp-image-256\" width=\"659\" height=\"579\" srcset=\"https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-26-1024x901.png 1024w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-26-300x264.png 300w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-26-768x676.png 768w, https:\/\/zdebla.me\/wp-content\/uploads\/2023\/06\/obrazok-26.png 1329w\" sizes=\"auto, (max-width: 659px) 100vw, 659px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>First I wanted to create a simple linear regression, plotting the number of flights and the number of delays in the SES area (RP1-RP2), for the period of years 2016-2019. I extracted the data from the EUROCONTROL datasets &#8211; Daily IFR traffic and en-route ATFM delay by entity and delay cause (FIR based) and Monthly [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[],"class_list":["post-255","post","type-post","status-publish","format-standard","hentry","category-articles"],"_links":{"self":[{"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/posts\/255","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/comments?post=255"}],"version-history":[{"count":1,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/posts\/255\/revisions"}],"predecessor-version":[{"id":260,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/posts\/255\/revisions\/260"}],"wp:attachment":[{"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/media?parent=255"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/categories?post=255"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zdebla.me\/index.php\/wp-json\/wp\/v2\/tags?post=255"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}