What is PySpark?

Spark is a distrubuted Big Data processing framework. If you use python then you can use PySpark which is the python api for Spark to process Big Data.

How to install?

Since PySpark is a distributed data processing framework, people use it by creating cluster. There are various clould provider such as AWS, Databricks, Clouder that allows you to create cluster. You can also install PySpark in standalone mode in your personal computer. In your virtual environment, simply run the following command to install PySpark in your machine.

pip install pyspark 

Simple PySpark Program

from pyspark.sql import SparkSession

spark = (SparkSession.builder
                     .appName("sample_program")
                     .getOrCreate())

Resources