Query Performance SQL vs Spark
21 May 2017
- Install AdventureWorks2014 database in local sql server
- Create the slow running script to compare query performance
- Come up with how to measure the execution time for querying
- Query the script in SSMS & Export the reulst as the file .csv
SET STATISTICS TIME ON
SELECT TOP 10000000
th.*
FROM Production.TransactionHistory th
JOIN Production.TransactionHistoryArchive tha ON th.Quantity = tha.Quantity
SET STATISTICS TIME OFF
(10000000 row(s) affected)
SQL Server Execution Times:
*CPU time = 3813 ms, elapsed time = 99440 ms.*
- Query the script in pyspark (standalone) by submitting a job & Export the result as the file .csv
from pyspark import SparkConf, SparkContext, SQLContext
import time
conf = SparkConf().setMaster("local").setAppName("TestingSqlQuery")
sc = SparkContext(conf = conf)
query = """(
SELECT TOP 10000000
th.*
FROM Production.TransactionHistory th
JOIN Production.TransactionHistoryArchive tha ON th.Quantity = tha.Quantity
) as Alias"""
sqlContext = SQLContext(sc)
start = time.time()
df = sqlContext.read.format("jdbc").options(
url="jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;",
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
dbtable=query,
numPartitions=4).load()
df.write.csv('testing-sql-query-result.csv')
end = time.time()
print str(end-start) + ' seconds'
PS C:\users\yohan\documents\spark> spark-submit testing-sql-query.py
17/05/22 22:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
*59.2020001411 seconds*