Setup

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.3.2
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 3.3.2
## Loading required package: proto
## Warning: package 'proto' was built under R version 3.3.2
## Loading required package: RSQLite
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.3.3
## Loading required package: methods
db <- tribble(
  ~id, ~salary,
  1, 100,
  2, 200,
  3, 300
)

Question

  • What is subquery in SQL?
  • What is scalar in SQL?

Things to remember

WHERE clause

  • = : Equal
  • <> : Not equal. Note: In some versions of SQL this operator may be written as !=
  • > : Greater than
  • < : Less than
  • >= : Greater than or equal
  • <= : Less than or equal
  • BETWEEN : Between an inclusive range
  • LIKE : Search for a pattern
  • IN : To specify multiple possible values for a column* `` :

Subquery

  • When you want to make two queries and combine the results, use subquery.
  • SQL Joins are more efficient than subqueries(subquery seems more intuitive to many user)
  • Subquery normally return an scaler value as result or result from one column if used along with IN Clause.
  • In correlated suqery setting, outer query gets processed before inner query

Subquery rules

  1. subquery can be used in insert statement.
  2. subquery can be used in select statement as column.
  3. subquery should always return either a scaler value if used with where clause or value from a column if used with IN or NOT IN clause.

Two types of subqueries

  • non-correlated: when inner query doesn’t depend on outer query and can run as stand alone query. Typically, it occurs in IN or NOT IN SQL clause.
  • correlated : correlated subqueries are slower queries and one should avoid it as much as possible. Use JOIN instead if you can.
library(tidyverse)
db_stock <- tribble(
  ~ric, ~company, ~listed_on_exchange,
  "6785.T", "Sony", "T",
  "GOOG.O", "Google Inc", "O",
  "GS.N", "Goldman Sachs Group Inc", "N",
  "INDIGO", "INDIGO Airlines", NA,
  "INFY.BO", "InfoSys", "BO",
  "VOD.L", "Vodafone Group PLC", "L"
)
db_market <- tribble(
  ~ric, ~name, ~country,
  "T", "Tokyo Stock Exchange", "Japan",
  "O", "NASDAQ", "United States",
  "N", "New York Stock Exchange", "United States",
  "BO", "Bombay Stock Exchange", "India"
)
#Find all stocks from Japan
sqldf("SELECT company from db_stock WHERE listed_on_exchange = (SELECT ric FROM db_market WHERE country='Japan')")
## Loading required package: tcltk
##   company
## 1    Sony
#Find all stocks from US and India
sqldf("SELECT company from db_stock WHERE listed_on_exchange IN (SELECT ric FROM db_market WHERE country='United States' OR country='India')")
##                   company
## 1              Google Inc
## 2 Goldman Sachs Group Inc
## 3                 InfoSys
# Return all markets which has at least one stock listed on it
db_market
## # A tibble: 4 x 3
##     ric                    name       country
##   <chr>                   <chr>         <chr>
## 1     T    Tokyo Stock Exchange         Japan
## 2     O                  NASDAQ United States
## 3     N New York Stock Exchange United States
## 4    BO   Bombay Stock Exchange         India
db_stock
## # A tibble: 6 x 3
##       ric                 company listed_on_exchange
##     <chr>                   <chr>              <chr>
## 1  6785.T                    Sony                  T
## 2  GOOG.O              Google Inc                  O
## 3    GS.N Goldman Sachs Group Inc                  N
## 4  INDIGO         INDIGO Airlines               <NA>
## 5 INFY.BO                 InfoSys                 BO
## 6   VOD.L      Vodafone Group PLC                  L
sqldf("SELECT m.name FROM db_market m WHERE m.ric = (SELECT s.listed_on_exchange FROM db_stock s WHERE s.listed_on_exchange = m.ric)")
##                      name
## 1    Tokyo Stock Exchange
## 2                  NASDAQ
## 3 New York Stock Exchange
## 4   Bombay Stock Exchange

Interview Questions

# Question:  Identify second highest salary 
# Answer - version 1 (non-correlated )
sqldf("SELECT max(salary) FROM db WHERE Salary NOT IN (SELECT max(salary) FROM db)")
##   max(salary)
## 1         200
# Answer - version 2 (correlated subquery)
sqldf("SELECT id, salary FROM db a WHERE 2=(SELECT COUNT(DISTINCT salary) FROM db b WHERE a.salary <=b.salary)")
##   id salary
## 1  2    200