Installing Pyspark · Sujan Tamang

What is PySpark?

Spark is a distrubuted Big Data processing framework. If you use python then you can use PySpark which is the python api for Spark to process Big Data.

How to install?

Since PySpark is a distributed data processing framework, people use it by creating cluster. There are various clould provider such as AWS, Databricks, Clouder that allows you to create cluster. You can also install PySpark in standalone mode in your personal computer. In your virtual environment, simply run the following command to install PySpark in your machine.

pip install pyspark

Simple PySpark Program

from pyspark.sql import SparkSession

spark = (SparkSession.builder
                     .appName("sample_program")
                     .getOrCreate())

Resources

Spark SQL Guide
Spark Python API Docs