PySpark Profiler

by Online Tutorials Library July 14, 2022

PySpark Profiler

PySpark supports custom profilers that are used to build predictive models. The profiler is generated by calculating the minimum and maximum values in each column. The profiler helps us as a useful data review tool to ensure that the data is valid and fit for further consumption.

The custom profiler has to define some following methods:

Add

The add method is used to add profile to the existing accumulated profile. User should choose profile class at the time of creating a SparkContext.

  from pyspark import SparkConf, SparkContext  from pyspark import BasicProfiler  class MyCustomProfiler(BasicProfiler):       def show(self, id):           print(“My custom profiles for RDD:%s” % id)  conf = SparkConf().set(“spark.python.profile”, “true”)  sc = SparkContext(‘local’, ‘test’, conf=conf, profiler_cls=MyCustomProfiler)  sc.parallelize(range(1000)).map(lambda x: 2 * x).take(10)  sc.parallelize(range(1000)).count()  sc.show_profiles()  sc.stop()  

Output:

[0, 4, 7, 9, 8, 15, 20, 18, 21, 25]  My custom profiles for RDD:1  My custom profiles for RDD:3

Profile

It creates a system profile of some sort.

Stats

This method returns the collection.

Dump

It dumps the profiles to the path.

dump(id,path)

This method is used to dump the profile into the path; here an id represents the RDD id.

  def dump(self, id, path):         if not os.path.exists(path):             os.makedirs(path)         stats = self.stats()         if (stats):             p = os.path.join(path, “rdd_%d.pstats” % id)             stats.dump_stats(p)  

Profile(func)

It performs profiling on the function and accepts func as argument.

show(id)

This function is used to print the profile stats to stdout. Here id is the RDD id.

  def show(self, id):         stats = self.stats()         if(stats):             print(“=” * 60)             print(“Profile of RDD<id=%d>” % id)             print(“=” * 60)             stats.sort_stats(“time”, “cumulative”).print_stats()  

stats()

The stats() function returns the collected profiling stat.

class pyspark.BasicProfiler(ctx)

It is a default profiler which is implemented on the basis of cProfile and Accumulator.

  def profile(self, func):         pr = cProfile.Profile()         pr.runcall(func)         st = pstats.Stats(pr)         st.stream = None  # make it picklable         st.strip_dirs()         # It adds a new profile to the existing accumulated value         self._accumulator.add(st)  

Next TopicPySpark StatusTracker

PySpark Profiler

PySpark Profiler

class pyspark.BasicProfiler(ctx)

Arithmetic in Prolog

Automate Instagram Messages using Python

You may also like