background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Database Querying with SQL] <br> <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 13 | Spring 2023] ] --- background-image: url("img/ggparty_s23.001.jpeg") background-size: 80% class: bottom, center, ### If you are able to attend, please RSVP: [https://bit.ly/ggpartys23](https://bit.ly/ggpartys23) --- ### Announcements * Extra credit lecture quiz on Gradescope * Updated [OH Schedule](https://docs.google.com/spreadsheets/d/1HqEmr4tEtFPWRrF5TJHd1VgtVD030w6aAYySoFTnhBw/edit?usp=sharing) * [Final Presentations + Food](https://docs.google.com/spreadsheets/d/1xU3w4sXQSWkU678YjtdnjyBZOuB6RtFiZeB4fbsZozo/edit?usp=sharing): SC 316 + Monday, May 8th noon - 2pm: Groups 3, 4, 6, 7, 9, 11, 12, 14, 16, 19, 21 + Wednesday, May 10th 9 - 11am: Groups 1, 2, 5, 8, 10, 13, 15, 17, 18, 20 ************************ ### Week's Goals .pull-left[ **Mon Lecture** * Database querying with SQL ] .pull-right[ **Wed Lecture** * Modeling with `tidymodels` ] --- ## Goals Database querying with SQL * Why learn `SQL` when I already know `dplyr` and `readr`? + [Check out this post](https://blog.sqlizer.io/posts/sql-43/). -- Not Covering: * Database Administration + [MDSR Ch 16](https://mdsr-book.github.io/mdsr2e/ch-sql2.html) + Useful chapter to read for those who are more interested in the CS side of data science. --- ## Database Querying What is a **database query**? -- * A request for data from a set of tables in the database. Why learn database querying? -- + For many organizations, they store their data in a relational database. + Think Google, Facebook, LinkedIn, ... + To know how to handle data that are too big to store in memory. + Instead you want to store it in a database (on disk) and then query the database for the bits you want to load in memory. --- ## Memory Versus Hard Disk A computer has two main options for storing data: -- * **Memory (RAM)**: amount of data that a computer can work with at once. + Likely have a few gigabytes -- * **Hard Disk**: amount of data that a computer can store permanently. + Likely have hundreds or thousands of gigabytes -- Trade-off between storage room (disks) and speed (memory is faster to access). --- ## Data Storage and `R` * `R` generally stores all objects in memory. -- * That can get to be a problem as our datasets get bigger: ```r library(pdxTrees) pdxTrees <- get_pdxTrees_streets() dim(pdxTrees) ``` ``` ## [1] 218602 23 ``` ```r print(object.size(pdxTrees), units = "Mb") ``` ``` ## 55.6 Mb ``` --- ## Data Storage and `R` ```r print(object.size(pdxTrees), units = "Mb") ``` ``` ## 55.6 Mb ``` * Once we get to 10M rows in our data frame, things will start to get problematic. -- * **Solution**: Use a system that stores all of the data on disk but allows us to access a portion in memory. + Will use a relational database to do this! + There are other solutions for really BIG data. --- ## Relational Database * **Relational database**: collection of linkable **tables** that are linked together by **keys** * **Table**: Two dimensional + Record = row + Field = column --- ### Relational Database * We will explore an open im(db) database today with the [following schema](https://imgur.com/pDq0n) <img src="img/imdb_schema.png" width="80%" style="display: block; margin: auto;" /> --- ## [SQL](https://en.wikipedia.org/wiki/SQL) * **SQL**: Structured Query Language * Language for interacting with data stored in a relational database. * We will focus on **querying** the database. * Can be used to create and modify databases. --- ## RDBMS * **RDBMS**: relational database management system * There are several RDBMSs out there. * Unfortunately, the SQL syntax does vary a little between systems, but not really for the basics. * We will use [MySQL](https://www.mysql.com/). + Open source but owned by Oracle + Used by Facebook, Google, LinkedIn, Twitter --- ## MySQL * Client-server model + Server: Where data are stored and queries are executed. + Located on the user's computer or a remote server. + Client: Program that connects to the server * I did not set up a MySQL server. * Instead we will use the Smith College MySQL server, `scidb.smith.edu`. + Will access it through RStudio using `dplyr`, `DBI`, and `RMySQL`. * If you find yourself using a MySQL server, you should check out the [MySQLWorkbench](https://www.mysql.com/products/workbench/). --- ## Connecting to MySQL * `dbConnect()` via `DBI` package (which is automatically loaded when you load `RMySQL`). ```r library(RMySQL) db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb") ``` --- ## Database Connection * We now have a database connection. ```r class(db) ``` ``` ## [1] "MySQLConnection" ## attr(,"package") ## [1] "RMySQL" ``` --- ## Tables in Database * The retrieval function: `dbGetQuery()` + First argument: database connection object. + Second argument: SQL query as a character vector. Ends in semi-colon. ```r db %>% dbGetQuery("SHOW TABLES;") ``` ``` ## Tables_in_imdb ## 1 aka_name ## 2 aka_title ## 3 cast_info ## 4 char_name ## 5 comp_cast_type ## 6 company_name ## 7 company_type ## 8 complete_cast ## 9 info_type ## 10 keyword ## 11 kind_type ## 12 link_type ## 13 movie_companies ## 14 movie_info ## 15 movie_info_idx ## 16 movie_keyword ## 17 movie_link ## 18 name ## 19 person_info ## 20 role_type ## 21 title ``` --- ## Another Option: Use SQL chunks * Will set some chunk options: ```r knitr::opts_chunk$set(connection = db, max.print = 20) ``` * Insert `sql` chunks instead of `r` chunks ````default ```{sql} SHOW TABLES; ``` ```` --- ## Retrieving data: `SELECT` and `FROM` * `SELECT` + Always starts a query * `FROM` + Specifies table ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` --- ## Retrieving data: `SELECT` and `FROM` * **What is the `dplyr` version?** What does `*` mean? ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` -- * Returns the kinds of "movies" in the database --- ## Saving Output ```r movie_kinds <- db %>% dbGetQuery("SELECT * FROM kind_type;") class(movie_kinds) ``` ``` ## [1] "data.frame" ``` --- ## A `SELECT` Word of Caution * I did NOT run the following chunk: ```r db %>% dbGetQuery("SELECT * FROM title;) ``` --- ## Retrieving data: `LIMIT` * **What is the `R` version?** ```r db %>% dbGetQuery("SELECT * FROM title LIMIT 10;") ``` ``` ## id title imdb_index kind_id ## 1 78460 Adults Recat to the Simpsons (30th Anniversary) <NA> 7 ## 2 70273 (2016-05-18) <NA> 7 ## 3 60105 (2014-04-11) <NA> 7 ## 4 32120 (2008-05-01) <NA> 7 ## 5 97554 Schmölders Traum <NA> 7 ## 6 57966 (#1.1) <NA> 7 ## 7 76391 Anniversary <NA> 7 ## 8 11952 Angus Black/Lester Barrie/DC Curry <NA> 7 ## 9 1554 New Orleans <NA> 7 ## 10 58442 Kiss Me Kate <NA> 7 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2017 NA A3432 78406 NA NA ## 2 2016 NA <NA> 68058 NA NA ## 3 2014 NA <NA> 59138 NA NA ## 4 2008 NA <NA> 32060 NA NA ## 5 2001 NA S2543 97302 10 1 ## 6 2013 NA <NA> 57965 1 1 ## 7 1971 NA A5162 76385 4 9 ## 8 2009 NA A5214 11937 4 7 ## 9 2003 NA N6452 1508 2 11 ## 10 2011 NA K2523 58436 1 10 ## series_years md5sum ## 1 <NA> 2ae09eed7d576cc2c24774fed5b18168 ## 2 <NA> 511dfc14cfff7589d29a95abb30cd66a ## 3 <NA> c6cdce7e667e07713e431805c407feed ## 4 <NA> 100df65742caf5afd092b2e0ead67d8e ## 5 <NA> 46862a2f96f9fb2d59e8c9a11ecfdd28 ## 6 <NA> 409c37703766c4b24f8a86162fd9cf85 ## 7 <NA> 5e12ce73fac1d1dcf94136b6e9acd8f8 ## 8 <NA> 9c38b9e5601dc154444b73b518034aa1 ## 9 <NA> 621bea735740a547e862e4a3226f35d2 ## 10 <NA> 293e8c75c7f35a4035abf617962be5a9 ``` --- ## Retrieving data: `SELECT` and `FROM` * Why were these five movies selected? ```r db %>% dbGetQuery("SELECT title AS movie, production_year AS year FROM title LIMIT 11452, 5;") ``` ``` ## movie year ## 1 Fogar, dulce fogar 2007 ## 2 Tokyo Rose/Communism, Italian Style/Two Can Live as Cheaply as One 1976 ## 3 Insane Clown Posse 2011 ## 4 Agent Carter 2015 ## 5 (#2.32) 2012 ``` * What is the `dplyr` version of `AS`? -- * In SQL, this is called giving it an **alias**. + Only applied to the resulting data frame, not the data on the SQL server. --- ## `DISTINCT` * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT DISTINCT role FROM role_type;") ``` ``` ## role ## 1 actor ## 2 actress ## 3 cinematographer ## 4 composer ## 5 costume designer ## 6 director ## 7 editor ## 8 guest ## 9 miscellaneous crew ## 10 producer ## 11 production designer ## 12 writer ``` --- ## `COUNT` * **What is the `dplyr`/R version?** + Why are they not equal? ```r db %>% dbGetQuery("SELECT COUNT(title), COUNT(production_year) FROM title;") ``` ``` ## COUNT(title) COUNT(production_year) ## 1 4626322 4342074 ``` --- ## `COUNT` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title;") ``` ``` ## COUNT(*) ## 1 4626322 ``` --- ## `COUNT` + `DISTINCT` ```r db %>% dbGetQuery("SELECT COUNT(DISTINCT production_year) FROM title;") ``` ``` ## COUNT(DISTINCT production_year) ## 1 144 ``` --- ## `WHERE` * **What is the `dplyr` version?** * What is different about the comparison operator syntax? * Single quotes are important. ```r db %>% dbGetQuery("SELECT * FROM title WHERE title = 'Legally Blonde';") ``` ``` ## id title imdb_index kind_id production_year imdb_id ## 1 2036044 Legally Blonde <NA> 7 2004 NA ## 2 3870186 Legally Blonde <NA> 3 2003 NA ## 3 3870185 Legally Blonde <NA> 1 2001 NA ## phonetic_code episode_of_id season_nr episode_nr series_years ## 1 L2414 2036035 1 6 <NA> ## 2 L2414 NA NA NA <NA> ## 3 L2414 NA NA NA <NA> ## md5sum ## 1 1e239affb0a0bbbda334facc27ce7c9e ## 2 eb63b80688731bf69ed9c6edefae258f ## 3 db3fb1dfadaf49454ea7e1f1c2dbaa44 ``` --- ## `WHERE` * What are all of these Legally Blonde entries?! -- ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` --- ## `WHERE` * What is different about the comparison operator syntax? ```r db %>% dbGetQuery("SELECT * FROM title WHERE title = 'Legally Blonde' AND kind_id = 1;") ``` ``` ## id title imdb_index kind_id production_year imdb_id ## 1 3870185 Legally Blonde <NA> 1 2001 NA ## phonetic_code episode_of_id season_nr episode_nr series_years ## 1 L2414 NA NA NA <NA> ## md5sum ## 1 db3fb1dfadaf49454ea7e1f1c2dbaa44 ``` --- ## Comparison syntax * `=` equal * `<>` not equal * `<` less than * `<=` less than or equal to --- ## `WHERE` + `AND` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year = 2016 AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 59702 ``` --- ## `WHERE` + `OR` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE (production_year = 2016 OR production_year = 2015) AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 116263 ``` --- Why aren't these the same? ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE (production_year = 2016 OR production_year = 2015) AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 116263 ``` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year = 2016 OR production_year = 2015 AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 318576 ``` --- ## `WHERE` + `BETWEEN` * `BETWEEN` is inclusive. ```r db %>% dbGetQuery("SELECT * FROM kind_type WHERE id BETWEEN 2 AND 4;") ``` ``` ## id kind ## 1 2 tv series ## 2 3 tv movie ## 3 4 video movie ``` --- ## `WHERE` + `IN` * **What is the `dplyr` version?** ```r db %>% dbGetQuery("SELECT * FROM kind_type WHERE id IN (2, 4, 7);") ``` ``` ## id kind ## 1 2 tv series ## 2 4 video movie ## 3 7 episode ``` --- ## `WHERE` + `IS NULL` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year IS NULL;") ``` ``` ## COUNT(*) ## 1 284248 ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE 'Legally%' LIMIT 0, 10;") ``` ``` ## id title imdb_index ## 1 3870184 Legally <NA> ## 2 242044 Legally Blind <NA> ## 3 486724 Legally Blonde 2 <NA> ## 4 1496011 Legally Blind <NA> ## 5 1496105 Legally Blonde the Musical: The Search for Elle Woods <NA> ## 6 1728481 Legally Blonde 2: Red, White & Blonde <NA> ## 7 2036044 Legally Blonde <NA> ## 8 2262079 Legally Blondes <NA> ## 9 2400129 Legally blonde 3 <NA> ## 10 3870186 Legally Blonde <NA> ## kind_id production_year imdb_id phonetic_code episode_of_id season_nr ## 1 1 NA NA L24 NA NA ## 2 7 NA NA L2414 242042 1 ## 3 7 2012 NA L2414 486643 4 ## 4 2 2017 NA L2414 NA NA ## 5 2 2008 NA L2414 NA NA ## 6 7 2003 NA L2414 1728469 1 ## 7 7 2004 NA L2414 2036035 1 ## 8 7 2013 NA L2414 2261985 5 ## 9 7 2011 NA L2414 2399989 1 ## 10 3 2003 NA L2414 NA NA ## episode_nr series_years md5sum ## 1 NA <NA> df1c7c45bf7fef52c387e10f506dc39b ## 2 3 <NA> 69ade384aa82ebd1d8661f4eae676454 ## 3 11 <NA> 5d2a60bc73688a19f0e72ed1bfdb4c34 ## 4 NA 2017-???? a3a60c4cb755af9b56085f961978bd85 ## 5 NA 2008-???? d9bd67c46b90a0b67db74d42521b31d3 ## 6 15 <NA> 69d91c76faafed212e3f6f4659d08b6d ## 7 6 <NA> 1e239affb0a0bbbda334facc27ce7c9e ## 8 2 <NA> 059bf3caa65c8868f3457135b8c68ff6 ## 9 6 <NA> b94e55876d7aead57800d96deb8c1006 ## 10 NA <NA> eb63b80688731bf69ed9c6edefae258f ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE '%Legally' LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id ## 1 1218173 I Love You Illegally <NA> 2 ## 2 1342321 Veronica Convinces Miguel to Adopt Her Legally <NA> 7 ## 3 3870184 Legally <NA> 1 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2012 NA L1424 NA NA NA ## 2 2012 NA V6525 1342206 1 117 ## 3 NA NA L24 NA NA NA ## series_years md5sum ## 1 2012-???? 3a6054e9f98c2a6664cd3734c23bdcc8 ## 2 <NA> 1b61542075b325fd60beffcf72b63087 ## 3 <NA> df1c7c45bf7fef52c387e10f506dc39b ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE 'Legally%' LIMIT 0, 10;") ``` ``` ## id title imdb_index ## 1 3870184 Legally <NA> ## 2 242044 Legally Blind <NA> ## 3 486724 Legally Blonde 2 <NA> ## 4 1496011 Legally Blind <NA> ## 5 1496105 Legally Blonde the Musical: The Search for Elle Woods <NA> ## 6 1728481 Legally Blonde 2: Red, White & Blonde <NA> ## 7 2036044 Legally Blonde <NA> ## 8 2262079 Legally Blondes <NA> ## 9 2400129 Legally blonde 3 <NA> ## 10 3870186 Legally Blonde <NA> ## kind_id production_year imdb_id phonetic_code episode_of_id season_nr ## 1 1 NA NA L24 NA NA ## 2 7 NA NA L2414 242042 1 ## 3 7 2012 NA L2414 486643 4 ## 4 2 2017 NA L2414 NA NA ## 5 2 2008 NA L2414 NA NA ## 6 7 2003 NA L2414 1728469 1 ## 7 7 2004 NA L2414 2036035 1 ## 8 7 2013 NA L2414 2261985 5 ## 9 7 2011 NA L2414 2399989 1 ## 10 3 2003 NA L2414 NA NA ## episode_nr series_years md5sum ## 1 NA <NA> df1c7c45bf7fef52c387e10f506dc39b ## 2 3 <NA> 69ade384aa82ebd1d8661f4eae676454 ## 3 11 <NA> 5d2a60bc73688a19f0e72ed1bfdb4c34 ## 4 NA 2017-???? a3a60c4cb755af9b56085f961978bd85 ## 5 NA 2008-???? d9bd67c46b90a0b67db74d42521b31d3 ## 6 15 <NA> 69d91c76faafed212e3f6f4659d08b6d ## 7 6 <NA> 1e239affb0a0bbbda334facc27ce7c9e ## 8 2 <NA> 059bf3caa65c8868f3457135b8c68ff6 ## 9 6 <NA> b94e55876d7aead57800d96deb8c1006 ## 10 NA <NA> eb63b80688731bf69ed9c6edefae258f ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE '%Legally%%' LIMIT 0, 10;") ``` ``` ## id ## 1 155004 ## 2 158063 ## 3 242044 ## 4 486724 ## 5 486675 ## 6 540039 ## 7 711841 ## 8 732752 ## 9 732715 ## 10 1188046 ## title ## 1 President Donald Trump Believes Millions Voted Illegally; Offers No Proof ## 2 Paeng Wants to Legally Adopt Mai-Mai ## 3 Legally Blind ## 4 Legally Blonde 2 ## 5 CMT #1 Music & a Movie: Legally Blonde ## 6 The Legally Dead Guy ## 7 Illegally High ## 8 I Want to Confront the Woman Whom I Believe Illegally Adopted My Granddaughter! ## 9 I Believe My Granddaughter Was Illegally Adopted. We Want Her Back! ## 10 Legally Yours ## imdb_index kind_id production_year imdb_id phonetic_code episode_of_id ## 1 <NA> 7 2017 NA P6235 153184 ## 2 <NA> 7 2012 NA P5253 157917 ## 3 <NA> 7 NA NA L2414 242042 ## 4 <NA> 7 2012 NA L2414 486643 ## 5 <NA> 7 2012 NA C5352 486643 ## 6 <NA> 7 2012 NA L2432 540033 ## 7 <NA> 7 2017 NA I4242 710440 ## 8 <NA> 7 2016 NA W5325 731686 ## 9 <NA> 7 2016 NA B4152 731686 ## 10 <NA> 7 2002 NA L2462 1188032 ## season_nr episode_nr series_years md5sum ## 1 14 99 <NA> 143f0c85802263addcb48350a59ff03f ## 2 2 27 <NA> d6baa394807d996df88a49785308f318 ## 3 1 3 <NA> 69ade384aa82ebd1d8661f4eae676454 ## 4 4 11 <NA> 5d2a60bc73688a19f0e72ed1bfdb4c34 ## 5 4 1 <NA> 46a4cded8da687563bb8f1a7a5693089 ## 6 2 3 <NA> a0923a77751c3c9010c1fd895ba60f87 ## 7 18 151 <NA> 74c6b210f055168e6bbc69c9b695e500 ## 8 14 100 <NA> 072808a6a9e939ab4e9ee84823e5c367 ## 9 14 99 <NA> e5c1dfb937658c91f86f2c52fee998bb ## 10 1 9 <NA> 8a842f6d0b78994069a824b277fa5b60 ``` --- ## Pattern Matching ```r db %>% dbGetQuery("SELECT * FROM title WHERE title REGEXP 'Legally Blonde' LIMIT 0, 10;") ``` ``` ## id ## 1 486724 ## 2 486675 ## 3 1295962 ## 4 1496105 ## 5 1728481 ## 6 2036044 ## 7 2296796 ## 8 2296863 ## 9 2262079 ## 10 2400129 ## title ## 1 Legally Blonde 2 ## 2 CMT #1 Music & a Movie: Legally Blonde ## 3 Illegally Blonde ## 4 Legally Blonde the Musical: The Search for Elle Woods ## 5 Legally Blonde 2: Red, White & Blonde ## 6 Legally Blonde ## 7 The League of Extraordinary Gentlemen/Pirates of the Caribbean: The Curse of the Black Pearl/Northfork/Legally Blonde 2: Red, White & Blonde ## 8 The Score/Legally Blonde/Bully ## 9 Legally Blondes ## 10 Legally blonde 3 ## imdb_index kind_id production_year imdb_id phonetic_code episode_of_id ## 1 <NA> 7 2012 NA L2414 486643 ## 2 <NA> 7 2012 NA C5352 486643 ## 3 <NA> 7 2015 NA I4241 1295902 ## 4 <NA> 2 2008 NA L2414 NA ## 5 <NA> 7 2003 NA L2414 1728469 ## 6 <NA> 7 2004 NA L2414 2036035 ## 7 <NA> 7 2003 NA L2123 2295842 ## 8 <NA> 7 2001 NA S2642 2295842 ## 9 <NA> 7 2013 NA L2414 2261985 ## 10 <NA> 7 2011 NA L2414 2399989 ## season_nr episode_nr series_years md5sum ## 1 4 11 <NA> 5d2a60bc73688a19f0e72ed1bfdb4c34 ## 2 4 1 <NA> 46a4cded8da687563bb8f1a7a5693089 ## 3 5 7 <NA> be11ca0f4b38ebc8bd3579d2bc37cadf ## 4 NA NA 2008-???? d9bd67c46b90a0b67db74d42521b31d3 ## 5 1 15 <NA> 69d91c76faafed212e3f6f4659d08b6d ## 6 1 6 <NA> 1e239affb0a0bbbda334facc27ce7c9e ## 7 18 6 <NA> 2e23b8ce4b9dc40d161e3fc5788ff54f ## 8 16 6 <NA> 4124e9d63662c1c8d5b886870e603168 ## 9 5 2 <NA> 059bf3caa65c8868f3457135b8c68ff6 ## 10 1 6 <NA> b94e55876d7aead57800d96deb8c1006 ``` * Can get much more sophisticated. + See examples [here](https://dev.mysql.com/doc/refman/5.7/en/pattern-matching.html). --- ## `ORDER BY` ```r db %>% dbGetQuery("SELECT * FROM title ORDER BY production_year DESC LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id ## 1 3204054 100 Years <NA> 1 ## 2 4448363 The Zero Century: Maetel <NA> 1 ## 3 3313453 Avatar 5 <NA> 1 ## 4 3313452 Avatar 4 <NA> 1 ## 5 3591414 Fantastic Beasts and Where to Find Them 5 <NA> 1 ## 6 3500140 Despicable Me 4 <NA> 1 ## 7 4339206 The Dark Room <NA> 1 ## 8 4540635 War in Black <NA> 1 ## 9 4504125 Under the Red Hood <NA> 1 ## 10 4448362 The Zero Century: Harlock <NA> 1 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2115 NA Y62 NA NA NA ## 2 2026 NA Z6253 NA NA NA ## 3 2025 NA A136 NA NA NA ## 4 2024 NA A136 NA NA NA ## 5 2024 NA F5323 NA NA NA ## 6 2024 NA D2121 NA NA NA ## 7 2024 NA D6265 NA NA NA ## 8 2024 NA W6514 NA NA NA ## 9 2024 NA U5363 NA NA NA ## 10 2023 NA Z6253 NA NA NA ## series_years md5sum ## 1 <NA> 09bf2aeba5b3ad6f20a9189b24977160 ## 2 <NA> 0a6b3bb9ea4283dd675b9e80c9d746e5 ## 3 <NA> 6f162e78084f4eea4b41a60798a3b059 ## 4 <NA> 78f8c4c0db18d9ecc7bbe21ff280a7c5 ## 5 <NA> 4148d5a83722967b5bae325b7020f66b ## 6 <NA> 05d26ac8773334b1e94d210b5bcb7202 ## 7 <NA> e5ecb2acc42c86b9792bc4bc25187861 ## 8 <NA> 8a9c42dde74fd4cbbeecc87f0eba9b1a ## 9 <NA> 138f6db743aab2ad3d7e5c4835777bee ## 10 <NA> 106e4daeb7a8d2bc4d98bd0cd5808ff0 ``` --- ## `JOIN`S Part of SQL's success is the ability to efficiently join tables. What information do we need to join two tables? -- * The name of the first table. * The type of join you want. * The name of the second table. * The conditions under which you want the records in table 1 to match records in table 2. -- Let's look at some examples with the im(db) database. * We want to find the keywords associated with "Legally Blonde". --- ## Schema Again As we go through these examples, I want you to ask yourself: * Why are there so many tables? * Why isn't there one movies table? <img src="img/imdb_schema.png" width="70%" style="display: block; margin: auto;" /> --- ## Keywords * Seems like we need to start with these two tables: `movie_keyword` and `keyword`. * What does `COUNT(*)` do again? ```r db %>% dbGetQuery("SELECT COUNT(*) FROM movie_keyword;") ``` ``` ## COUNT(*) ## 1 7342172 ``` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM keyword;") ``` ``` ## COUNT(*) ## 1 230407 ``` --- ```r db %>% dbGetQuery("SELECT * FROM movie_keyword LIMIT 0, 8;") ``` ``` ## id movie_id keyword_id ## 1 1 2 1 ## 2 2 35 2 ## 3 3 55 3 ## 4 4 55 4 ## 5 5 55 5 ## 6 6 57 6 ## 7 7 102 7 ## 8 8 102 8 ``` ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ## 4 8228 reference-to-dagmar-havlova R1652 ## 5 352 jane-austen J5235 ## 6 31 woods W32 ## 7 5491 teenage-son T525 ## 8 5830 hanging H5252 ``` --- ## Keywords What is the SQL command to arrange the `keyword` dataset by `id`? ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ## 4 8228 reference-to-dagmar-havlova R1652 ## 5 352 jane-austen J5235 ## 6 31 woods W32 ## 7 5491 teenage-son T525 ## 8 5830 hanging H5252 ``` --- ## Keywords What is the SQL command to arrange the `keyword` dataset by `id`? ```r db %>% dbGetQuery("SELECT * FROM keyword ORDER BY id LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 1 number-in-title N5165 ## 2 2 web-series W1262 ## 3 3 friend F653 ## 4 4 heroin H65 ## 5 5 vlog V42 ## 6 6 tv-mini-series T1526 ## 7 7 austin-texas A2353 ## 8 8 beer B6 ``` --- ## Key for Keywords? What is the key for linking `keyword` and `movie_keyword`? <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ## `JOIN` * Notice: `movie_keyword.movie_id` + R syntax? ```r db %>% dbGetQuery("SELECT movie_keyword.movie_id, keyword.keyword FROM movie_keyword JOIN keyword ON movie_keyword.keyword_id = keyword.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` --- ## `JOIN` * Can remove the dataset name if... ```r db %>% dbGetQuery("SELECT movie_id, keyword FROM movie_keyword JOIN keyword ON keyword_id = id LIMIT 0, 8;") ``` ``` ## Error in .local(conn, statement, ...): could not run statement: Column 'id' in on clause is ambiguous ``` --- ## `JOIN` * Can remove the dataset name if... ```r db %>% dbGetQuery("SELECT movie_id, keyword FROM movie_keyword JOIN keyword ON keyword_id = keyword.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` --- ## `JOIN` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM movie_keyword JOIN keyword ON keyword_id = keyword.id;") ``` ``` ## COUNT(*) ## 1 7342172 ``` --- ## Table Aliases ```r db %>% dbGetQuery("SELECT mk.movie_id, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` * Now what table to do we want to `JOIN` to our current result? --- ## Keywords <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id LIMIT 0, 8;") ``` ``` ## title keyword ## 1 #1 Single number-in-title ## 2 #1MinuteNightmare web-series ## 3 #30Nods friend ## 4 #30Nods heroin ## 5 #30Nods vlog ## 6 #4Hire tv-mini-series ## 7 #ATown austin-texas ## 8 #ATown beer ``` --- Now we are close to finding the keywords in 'Legally Blonde'. * If we want to filter to just 'Legally Blonde', what SQL command do we need? ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id LIMIT 0, 8;") ``` ``` ## title keyword ## 1 #1 Single number-in-title ## 2 #1MinuteNightmare web-series ## 3 #30Nods friend ## 4 #30Nods heroin ## 5 #30Nods vlog ## 6 #4Hire tv-mini-series ## 7 #ATown austin-texas ## 8 #ATown beer ``` --- Now we are close to finding the keywords in 'Legally Blonde'. * If we want to filter to just 'Legally Blonde', what SQL command do we need to add? ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Legally Blonde' AND t.kind_id = 1 LIMIT 0, 8;") ``` ``` ## title keyword ## 1 Legally Blonde 2000s ## 2 Legally Blonde aerobics ## 3 Legally Blonde based-on-novel ## 4 Legally Blonde beauty-salon ## 5 Legally Blonde bikini ## 6 Legally Blonde black-female-judge ## 7 Legally Blonde blonde ## 8 Legally Blonde blonde-girl ``` --- ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Legally Blonde' AND t.kind_id = 1 LIMIT 0, 300;") ``` ``` ## title keyword ## 1 Legally Blonde 2000s ## 2 Legally Blonde aerobics ## 3 Legally Blonde based-on-novel ## 4 Legally Blonde beauty-salon ## 5 Legally Blonde bikini ## 6 Legally Blonde black-female-judge ## 7 Legally Blonde blonde ## 8 Legally Blonde blonde-girl ## 9 Legally Blonde blonde-hair ## 10 Legally Blonde blonde-stereotype ## 11 Legally Blonde blonde-woman ## 12 Legally Blonde boston-massachusetts ## 13 Legally Blonde boy ## 14 Legally Blonde broken-engagement ## 15 Legally Blonde california ## 16 Legally Blonde chick-flick ## 17 Legally Blonde chihuahua ## 18 Legally Blonde client ## 19 Legally Blonde college ## 20 Legally Blonde college-student ## 21 Legally Blonde color-pink ## 22 Legally Blonde courtroom ## 23 Legally Blonde cultural-conflict ## 24 Legally Blonde delivery-man ## 25 Legally Blonde dog ## 26 Legally Blonde enlightenment ## 27 Legally Blonde ex-boyfriend-ex-girlfriend-relationship ## 28 Legally Blonde fashion ## 29 Legally Blonde female-protagonist ## 30 Legally Blonde first-part ## 31 Legally Blonde fish-out-of-water ## 32 Legally Blonde fraternity ## 33 Legally Blonde girl ## 34 Legally Blonde girl-power ## 35 Legally Blonde harvard ## 36 Legally Blonde harvard-law-school ## 37 Legally Blonde intern ## 38 Legally Blonde law ## 39 Legally Blonde law-school ## 40 Legally Blonde law-student ## 41 Legally Blonde lawyer ## 42 Legally Blonde los-angeles-california ## 43 Legally Blonde manicurist ## 44 Legally Blonde mistaken-motive ## 45 Legally Blonde murder ## 46 Legally Blonde pink ## 47 Legally Blonde prison ## 48 Legally Blonde product-placement ## 49 Legally Blonde professor ## 50 Legally Blonde reference-to-britney-spears ## 51 Legally Blonde reference-to-ricky-martin ## 52 Legally Blonde roommate ## 53 Legally Blonde sexual-harassment ## 54 Legally Blonde shower ## 55 Legally Blonde smart-blonde ## 56 Legally Blonde sorority ## 57 Legally Blonde spa ## 58 Legally Blonde teen-comedy ## 59 Legally Blonde teen-movie ## 60 Legally Blonde trial ## 61 Legally Blonde trust ## 62 Legally Blonde two-word-title ## 63 Legally Blonde vegetarian ## 64 Legally Blonde what-happened-to-epilogue ## 65 Legally Blonde wolf-whistle ## 66 Legally Blonde woman ``` --- ## Just For Fun ```r db %>% dbGetQuery("SELECT title FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE k.keyword IN ('harvard-university', 'harvard');") ``` ``` ## title ## 1 Conspiracy Theories ## 2 Fringe ## 3 Pilot ## 4 It's All Relative ## 5 Overnight Trip ## 6 (#1.19) ## 7 The End of the Berkeley Blockade ## 8 The First Olympics: Athens 1896 ## 9 Pussycat Doll vs. Pussy-Hat Dolls ## 10 Profile in Silver/Button, Button ## 11 (2017-10-18) ## 12 'Northwest Passage' (Book I -- Rogers' Rangers) ## 13 21 ## 14 A Great Education ## 15 A Small Act ## 16 Accepted ## 17 Admissions ## 18 American Psycho ## 19 Angels & Demons ## 20 Bamboozled ## 21 Bobby Jones: Stroke of Genius ## 22 Boston Terrier ## 23 Brad's Status ## 24 Change at 125th Street ## 25 Dead Poets Society ## 26 El secreto de sus ojos ## 27 First Affair ## 28 Flight ## 29 Giant ## 30 Good Will Hunting ## 31 H.M. Pulham, Esq. ## 32 Happy Birthday, Gemini ## 33 Harvard Beats Yale 29-29 ## 34 Haven ## 35 Heaven's Gate ## 36 How High ## 37 Huddle ## 38 Jimmy Carter Man from Plains ## 39 Joe Gould's Secret ## 40 L'héritier ## 41 Lady Barnacle ## 42 Legally Blonde 2: Red, White & Blonde ## 43 Live at Harvard: Dave Rubin, Bret Weinstein, Steve Simpson on Free Speech in America ## 44 Margot at the Wedding ## 45 Mona Lisa Smile ## 46 Monumental: In Search of America's National Treasure ## 47 Mr. & Mrs. Bridge ## 48 My Teacher's Wife ## 49 Mystery Street ## 50 No Look Pass ## 51 Personal Velocity: Three Portraits ## 52 Pigskin Parade ## 53 Prozac Nation ## 54 Ruthless ## 55 School Ties ## 56 Seeing Other People ## 57 Shakespeare's Sonnets ## 58 Soul Man ## 59 Southside with You ## 60 Swimming Upstream ## 61 Temple Grandin ## 62 The Autumn Heart ## 63 The Da Vinci Code ## 64 The Departed ## 65 The Devil's Hour ## 66 The Firm ## 67 The Fog of War: Eleven Lessons from the Life of Robert S. McNamara ## 68 The Great Debaters ## 69 The Nanny Diaries ## 70 The Nuer ## 71 The Paper Chase ## 72 The Price of Pleasure: Pornography, Sexuality & Relationships ## 73 The Skulls ## 74 The Social Network ## 75 The Supreme Sacrifice ## 76 The Unbelievable Truth ## 77 The Wrong Guy ## 78 The Young Rajah ## 79 Theodore Roosevelt: A Cowboy's Ride to the White House ## 80 To Race the Wind ## 81 Two of a Kind ## 82 Waking the Dead ## 83 Rejection & Rocketships ## 84 Eleanor and Franklin ## 85 Harvard Court ## 86 The Adventures of Young Hillary ## 87 The Paper Chase ## 88 'Northwest Passage' (Book I -- Rogers' Rangers) ## 89 21 ## 90 A Beautiful Mind ## 91 A Small Circle of Friends ## 92 Altered States ## 93 Beat the House ## 94 Between Two Friends ## 95 Blue Blood and Red ## 96 Both Ways ## 97 Brown of Harvard ## 98 Brown of Harvard ## 99 Compared to What: The Improbable Journey of Barney Frank ## 100 Corso: The Last Beat ## 101 Donnie Darko ## 102 Freshman Father ## 103 God's Country ## 104 Good Will Hunting ## 105 HairBrained ## 106 Happy Birthday, Gemini ## 107 Harvard Man ## 108 Hold 'Em Yale ## 109 How High ## 110 Ice Princess ## 111 If You Could See What I Hear ## 112 Inside 'Legally Blonde' ## 113 John Harvard ## 114 Just Cause ## 115 Legally Blonde ## 116 Love Story ## 117 MysteryDisc: Many Roads to Murder ## 118 PERKINS 28: Testimony from the Secret Court Files of 1920 ## 119 Path to War ## 120 Selig-Tribune, No. 21 ## 121 Son of Paleface ## 122 Southside with You ## 123 Spartan ## 124 Spring Madness ## 125 Stealing Harvard ## 126 Stranger Than Fiction, #69 ## 127 Swimming Upstream ## 128 Ted ## 129 The Da Vinci Code ## 130 The Firm ## 131 The Gumshoe Kid ## 132 The Ivy League Farmer ## 133 The Lady Eve ## 134 The Paper Chase ## 135 The Pilot Season Survival Guide ## 136 The Skin I'm In ## 137 The Social Network ## 138 Theodore Roosevelt: A Cowboy's Ride to the White House ## 139 With Honors ``` --- ## Why are there so many datasets?! Why not just store all of the information about the movies in ONE dataset? --- What does a row represent for `title`? For `keyword`? For `movie_keyword`? ```r db %>% dbGetQuery("SELECT title FROM title LIMIT 4372899, 3;") ``` ``` ## title ## 1 The Master ## 2 The Heist ## 3 The Four Wizards at Deathly Side ``` ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 3;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ``` ```r db %>% dbGetQuery("SELECT * FROM movie_keyword LIMIT 0, 3;") ``` ``` ## id movie_id keyword_id ## 1 1 2 1 ## 2 2 35 2 ## 3 3 55 3 ``` --- ## Why are there so many datasets?! Why not just store all of the information about the movies in ONE dataset? -- + Not very efficient. -- Database solution: + Store info about the movies in `title`. + Store info about the keywords in `keyword`. + Store short codes for the `keywords` in `keyword_id`. + Only `JOIN` when we need to. --- ## Last Step Don't forget to store your dataset in R! ```r lb_keywords <- db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Legally Blonde' AND t.kind_id = 1;") ``` --- ## New Example * Want to see how to use the aggregate functions but the `imdb` database doesn't have a lot of quantitative data. ```r library(tidyverse) library(RMySQL) db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "waiuser", password = "smith_waiDB", dbname = "wai") ``` --- ## New Example * A database of wideband acoustic immitance variables from humans with normal hearing. + [More info here](http://www.science.smith.edu/wai-database/home/download-data/) ```r dbGetQuery(db, "SHOW TABLES;") ``` ``` ## Tables_in_wai ## 1 Codebook ## 2 Measurements ## 3 Measurements_pre2020 ## 4 Merchant2010_Measurements ## 5 PI_Info ## 6 PI_Info_OLD ## 7 Subjects ## 8 Subjects_pre2020 ## 9 WernerAdults2010_Measurements ``` ```r dbGetQuery(db, "EXPLAIN PI_Info;") ``` ``` ## Field Type Null Key Default Extra ## 1 Identifier varchar(50) NO PRI <NA> ## 2 Year int NO <NA> ## 3 Authors text NO <NA> ## 4 AuthorsShortList text NO <NA> ## 5 Title text NO <NA> ## 6 Journal text NO <NA> ## 7 URL text NO <NA> ## 8 Abstract text NO <NA> ## 9 DataSubmitterName text NO <NA> ## 10 DataSubmitterEmail text NO <NA> ## 11 DateSubmitted text NO <NA> ## 12 PI_Notes text NO <NA> ``` --- ## New Example ```r dbGetQuery(db, "SELECT * FROM Measurements LIMIT 10;") ``` ``` ## Identifier SubjectNumber Session Ear Instrument Age AgeCategory EarStatus ## 1 Abur_2014 1 1 Left HearID 20 Adult Normal ## 2 Abur_2014 1 1 Left HearID 20 Adult Normal ## 3 Abur_2014 1 1 Left HearID 20 Adult Normal ## 4 Abur_2014 1 1 Left HearID 20 Adult Normal ## 5 Abur_2014 1 1 Left HearID 20 Adult Normal ## 6 Abur_2014 1 1 Left HearID 20 Adult Normal ## 7 Abur_2014 1 1 Left HearID 20 Adult Normal ## 8 Abur_2014 1 1 Left HearID 20 Adult Normal ## 9 Abur_2014 1 1 Left HearID 20 Adult Normal ## 10 Abur_2014 1 1 Left HearID 20 Adult Normal ## TPP AreaCanal PressureCanal SweepDirection Frequency Absorbance Zmag ## 1 -5 4.42e-05 0 Ambient 210.938 0.0333379 113780000 ## 2 -5 4.42e-05 0 Ambient 234.375 0.0315705 103585000 ## 3 -5 4.42e-05 0 Ambient 257.812 0.0405751 92951700 ## 4 -5 4.42e-05 0 Ambient 281.250 0.0438399 86058000 ## 5 -5 4.42e-05 0 Ambient 304.688 0.0486400 79492800 ## 6 -5 4.42e-05 0 Ambient 328.125 0.0527801 73326200 ## 7 -5 4.42e-05 0 Ambient 351.562 0.0583192 68793600 ## 8 -5 4.42e-05 0 Ambient 375.000 0.0638881 64088600 ## 9 -5 4.42e-05 0 Ambient 398.438 0.0687025 60200600 ## 10 -5 4.42e-05 0 Ambient 421.875 0.0833181 56990900 ## Zang ## 1 -0.233504 ## 2 -0.235778 ## 3 -0.233482 ## 4 -0.233421 ## 5 -0.232931 ## 6 -0.232837 ## 7 -0.232115 ## 8 -0.231642 ## 9 -0.231356 ## 10 -0.228356 ``` --- ## Aggregate Functions * We have already seen `COUNT` * What does this code seem to do? ```r db %>% dbGetQuery("SELECT Identifier, COUNT(*) AS n FROM Measurements GROUP BY Identifier ORDER BY n DESC;") ``` ``` ## Identifier n ## 1 Sanford_2009 2788620 ## 2 Sun_2023 598130 ## 3 Keefe_2012 477360 ## 4 Lewis_2018 245820 ## 5 Myers_2018 116202 ## 6 Lewis_2015 114716 ## 7 Merchant_2021 61152 ## 8 Shahnaz_2006 58776 ## 9 Hunter_2016 44726 ## 10 Groon_2015 35469 ## 11 Keefe_2003 30426 ## 12 Downing_2022 29568 ## 13 Werner_2010 24360 ## 14 Feeney_2017 22134 ## 15 Abur_2014 21328 ## 16 Voss_2016 19344 ## 17 Pitaro_2016 17340 ## 18 Voss_2010 14880 ## 19 Rosowski_2012 14384 ## 20 Merchant_2020 13468 ## 21 Merchant_2015 9920 ## 22 Nakajima_2012 7688 ## 23 Merchant_2010 7688 ## 24 Ellison_2012 6720 ## 25 Liu_2008 5520 ## 26 Aithal_2017a 5200 ## 27 Voss_1994 5120 ## 28 Aithal_2014b 4848 ## 29 Aithal_2019a 4640 ## 30 Aithal_2019b 3872 ## 31 Keefe_2017 3696 ## 32 Aithal_2020b 3216 ## 33 Aithal_2015 3056 ## 34 Shaver_2013 2880 ## 35 Sun_2016 2604 ## 36 Aithal_2014 2368 ## 37 Sanford_2014 1736 ## 38 Aithal_2020a 1216 ## 39 Aithal_2013 1056 ``` --- ## Aggregate Functions * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgFreq, MAX(Frequency) AS maxFreq, SUM(Absorbance) AS totalAbsorbance FROM Measurements;") ``` ``` ## avgFreq maxFreq totalAbsorbance ## 1 3144.861 24000 2021625 ``` --- ## Aggregate Functions ```r db %>% dbGetQuery("SELECT AVG(Frequency) FROM Measurements WHERE Ear = 'Left';") ``` ``` ## AVG(Frequency) ## 1 2591.847 ``` --- ## Aggregate Functions * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT AVG(Frequency), Ear FROM Measurements GROUP BY Ear;") ``` ``` ## AVG(Frequency) Ear ## 1 2591.847 Left ## 2 3629.727 Right ## 3 2318.750 Righ ## 4 5050.783 Unknown ``` * But the aggregate functions are limited. --- ## One More SQL Function * Need to fix this error ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgfreq FROM Measurements WHERE avgfreq > 6000 GROUP BY IDENTIFIER;") ``` ``` ## Error in .local(conn, statement, ...): could not run statement: Unknown column 'avgfreq' in 'where clause' ``` --- ## One More SQL Function: `HAVING` ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgfreq FROM Measurements GROUP BY IDENTIFIER HAVING avgfreq > 6000") ``` ``` ## avgfreq ## 1 12000.00 ## 2 12000.00 ## 3 10019.53 ``` --- ## Query Order in SQL: Order matters! * `SELECT` * `FROM` * `JOIN` * `WHERE` * `GROUP BY` * `HAVING` * `ORDER BY` * `LIMIT` --- ## `R` Versus `SQL` * `SQL` Positives: + Great at accessing big(ger) data (and smaller data) + Very efficient at joining data, filtering data, selecting columns... * `SQL` Negatives: + Summary statistics and data wrangling are very limited in scope + Can't plot, fit models, run hypothesis tests, ... --- ## Interacting with the `MySQL` server via `dplyr` ```r db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb") title <- tbl(db, "title") keyword <- tbl(db, "keyword") movie_keyword <- tbl(db, "movie_keyword") class(title) ``` ``` ## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` --- ```r dat <- inner_join(keyword, movie_keyword, by = c("id" = "keyword_id")) %>% inner_join(title, by = c("movie_id" = "id")) %>% filter(title == "Legally Blonde", kind_id == 1) %>% select(title, keyword) dat ``` ``` ## # Source: SQL [?? x 2] ## # Database: mysql 8.0.32-0ubuntu0.20.04.2 [@scidb.smith.edu:/imdb] ## title keyword ## <chr> <chr> ## 1 Legally Blonde 2000s ## 2 Legally Blonde aerobics ## 3 Legally Blonde based-on-novel ## 4 Legally Blonde beauty-salon ## 5 Legally Blonde bikini ## 6 Legally Blonde black-female-judge ## 7 Legally Blonde blonde ## 8 Legally Blonde blonde-girl ## 9 Legally Blonde blonde-hair ## 10 Legally Blonde blonde-stereotype ## # ℹ more rows ``` --- ## Translation of `dplyr` to `SQL` ```r show_query(dat) ``` ``` ## <SQL> ## SELECT `title`, `keyword` ## FROM ( ## SELECT ## `keyword`.`id` AS `id`, ## `keyword`, ## `keyword`.`phonetic_code` AS `phonetic_code.x`, ## `movie_keyword`.`id` AS `id.y`, ## `movie_id`, ## `title`, ## `imdb_index`, ## `kind_id`, ## `production_year`, ## `imdb_id`, ## `title`.`phonetic_code` AS `phonetic_code.y`, ## `episode_of_id`, ## `season_nr`, ## `episode_nr`, ## `series_years`, ## `md5sum` ## FROM `keyword` ## INNER JOIN `movie_keyword` ## ON (`keyword`.`id` = `movie_keyword`.`keyword_id`) ## INNER JOIN `title` ## ON (`movie_keyword`.`movie_id` = `title`.`id`) ## ) `q01` ## WHERE (`title` = 'Legally Blonde') AND (`kind_id` = 1.0) ```