What is PySpark?
Spark is a distrubuted Big Data processing framework. If you use python
then you can use PySpark
which is the python api for Spark to process Big Data.
How to install?
Since PySpark is a distributed data processing framework, people use it by creating cluster. There are various clould provider such as AWS, Databricks, Clouder that allows you to create cluster. You can also install PySpark in standalone mode in your personal computer. In your virtual environment, simply run the following command to install PySpark in your machine.
pip install pyspark
Simple PySpark Program
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("sample_program")
.getOrCreate())