Connecting Azure SQL through PySpark

31 Mar 2017

Download sqljdbc
Extract files in a directory c:/spark
Copy & paste c:/spark/sqljdbc_6.0/enu/auth/sqljdb_auth.dll to c:/windows/system32
Create test-sql.py as below

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)

query = "(SELECT Id from Matter) as mt"

sqlContext = SQLContext(sc)

df = sqlContext.read.format("jdbc").options(
  url="jdbc:sqlserver://domain.database.windows.net;" +
         "databaseName=collaborate-7;user=[USER_ID]];password=[Password]];Integrated Security=False",
  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",  
  dbtable=query).load()

df.show()

Run cmd with driver as spark-submit –driver-class-path c:/spark/sqljdbc_6.0/enu/jre8/sqljdbc42.jar test-sql.py
Output

<class 'pyspark.sql.dataframe.DataFrame'>
+---+
| Id|
+---+
| 12|
| 13|
|  1|
|  7|
|  2|
|  8|
|  9|
| 10|
| 11|
|  3|
+---+
only showing top 10 rows

See more spark-pyspark-to-extract-from-sql-server

Yohan

Connecting Azure SQL through PySpark

Related Posts

Actor Model with Message Queue 04 Jun 2017

Running Spark Cluster Mode on Windows Prompt 24 May 2017

Query Performance SQL vs Spark 21 May 2017