Query Performance SQL vs Spark

21 May 2017

Install AdventureWorks2014 database in local sql server
Create the slow running script to compare query performance
Come up with how to measure the execution time for querying
Query the script in SSMS & Export the reulst as the file .csv

SET STATISTICS TIME ON

SELECT	TOP 10000000 
		th.* 
FROM	Production.TransactionHistory th
JOIN	Production.TransactionHistoryArchive tha ON th.Quantity = tha.Quantity

SET STATISTICS TIME OFF

(10000000 row(s) affected)

 SQL Server Execution Times:
   *CPU time = 3813 ms,  elapsed time = 99440 ms.*

Query the script in pyspark (standalone) by submitting a job & Export the result as the file .csv

from pyspark import SparkConf, SparkContext, SQLContext
import time

conf = SparkConf().setMaster("local").setAppName("TestingSqlQuery")
sc = SparkContext(conf = conf)

query = """(   
SELECT	TOP 10000000 
		th.* 
FROM	Production.TransactionHistory th
JOIN	Production.TransactionHistoryArchive tha ON th.Quantity = tha.Quantity
) as Alias"""

sqlContext = SQLContext(sc)

start = time.time()
df = sqlContext.read.format("jdbc").options(
  url="jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;", 
  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",  
  dbtable=query,
  numPartitions=4).load()

df.write.csv('testing-sql-query-result.csv')

end = time.time()
print str(end-start) + ' seconds'

PS C:\users\yohan\documents\spark> spark-submit testing-sql-query.py
17/05/22 22:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
*59.2020001411 seconds*

Yohan

Query Performance SQL vs Spark

Related Posts

Actor Model with Message Queue 04 Jun 2017

Running Spark Cluster Mode on Windows Prompt 24 May 2017

Installing Apache Spark on Windows 13 May 2017