Setup
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.3.2
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 3.3.2
## Loading required package: proto
## Warning: package 'proto' was built under R version 3.3.2
## Loading required package: RSQLite
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.3.3
## Loading required package: methods
db <- tribble(
~id, ~salary,
1, 100,
2, 200,
3, 300
)
Question
- What is subquery in SQL?
- What is scalar in SQL?
Things to remember
WHERE clause
=
: Equal<>
: Not equal. Note: In some versions of SQL this operator may be written as !=>
: Greater than<
: Less than>=
: Greater than or equal<=
: Less than or equalBETWEEN
: Between an inclusive rangeLIKE
: Search for a patternIN
: To specify multiple possible values for a column* `` :
Subquery
- When you want to make two queries and combine the results, use subquery.
- SQL Joins are more efficient than subqueries(subquery seems more intuitive to many user)
- Subquery normally return an scaler value as result or result from one column if used along with IN Clause.
- In correlated suqery setting, outer query gets processed before inner query
Subquery rules
- subquery can be used in insert statement.
- subquery can be used in select statement as column.
- subquery should always return either a scaler value if used with where clause or value from a column if used with IN or NOT IN clause.
Two types of subqueries
- non-correlated: when inner query doesn’t depend on outer query and can run as stand alone query. Typically, it occurs in
IN
orNOT IN
SQL clause. - correlated : correlated subqueries are slower queries and one should avoid it as much as possible. Use JOIN instead if you can.
library(tidyverse)
db_stock <- tribble(
~ric, ~company, ~listed_on_exchange,
"6785.T", "Sony", "T",
"GOOG.O", "Google Inc", "O",
"GS.N", "Goldman Sachs Group Inc", "N",
"INDIGO", "INDIGO Airlines", NA,
"INFY.BO", "InfoSys", "BO",
"VOD.L", "Vodafone Group PLC", "L"
)
db_market <- tribble(
~ric, ~name, ~country,
"T", "Tokyo Stock Exchange", "Japan",
"O", "NASDAQ", "United States",
"N", "New York Stock Exchange", "United States",
"BO", "Bombay Stock Exchange", "India"
)
#Find all stocks from Japan
sqldf("SELECT company from db_stock WHERE listed_on_exchange = (SELECT ric FROM db_market WHERE country='Japan')")
## Loading required package: tcltk
## company
## 1 Sony
#Find all stocks from US and India
sqldf("SELECT company from db_stock WHERE listed_on_exchange IN (SELECT ric FROM db_market WHERE country='United States' OR country='India')")
## company
## 1 Google Inc
## 2 Goldman Sachs Group Inc
## 3 InfoSys
# Return all markets which has at least one stock listed on it
db_market
## # A tibble: 4 x 3
## ric name country
## <chr> <chr> <chr>
## 1 T Tokyo Stock Exchange Japan
## 2 O NASDAQ United States
## 3 N New York Stock Exchange United States
## 4 BO Bombay Stock Exchange India
db_stock
## # A tibble: 6 x 3
## ric company listed_on_exchange
## <chr> <chr> <chr>
## 1 6785.T Sony T
## 2 GOOG.O Google Inc O
## 3 GS.N Goldman Sachs Group Inc N
## 4 INDIGO INDIGO Airlines <NA>
## 5 INFY.BO InfoSys BO
## 6 VOD.L Vodafone Group PLC L
sqldf("SELECT m.name FROM db_market m WHERE m.ric = (SELECT s.listed_on_exchange FROM db_stock s WHERE s.listed_on_exchange = m.ric)")
## name
## 1 Tokyo Stock Exchange
## 2 NASDAQ
## 3 New York Stock Exchange
## 4 Bombay Stock Exchange
Interview Questions
# Question: Identify second highest salary
# Answer - version 1 (non-correlated )
sqldf("SELECT max(salary) FROM db WHERE Salary NOT IN (SELECT max(salary) FROM db)")
## max(salary)
## 1 200
# Answer - version 2 (correlated subquery)
sqldf("SELECT id, salary FROM db a WHERE 2=(SELECT COUNT(DISTINCT salary) FROM db b WHERE a.salary <=b.salary)")
## id salary
## 1 2 200