Home » Pickle Module of Python

Pickle Module of Python

by Online Tutorials Library

Pickle Module of Python

A developer may sometimes want to send some complex object commands through the network and save the internal state of their objects to the disk or database for using it later. To achieve this, the developer can use the serialization process, which is supported by the standard library, cause of Python’s Pickle Module.

In this tutorial, we will discuss serializing and deserializing the object and which module the user should use for serializing the objects in python. The kind of objects can be serialized by using the Pickle module in python. We will also explain how to use the Pickle module for serializing the object hierarchies and what are the risks a developer can face while deserializing the object from the untrusted source?

Serialization in Python

The process of serializing is to convert the data structure into a linear form, which can be stored or transmitted through the network.

In Python, the serialization allows the developer to convert the complex object structure into a stream of bytes that can be saved in the disk or can send through the network. The developer can refer to this process as marshalling. Whereas, Deserialization is the reverse process of serialization in which the user takes the stream of bytes and transforms it into the data structure. This process can be referred to as unmarshalling.

The developer can use serialization in many different situations. And one of them is saving the internal state of the neural networking after processing the training phase so that they can use the state later and they don’t have to do the training again.

In Python there are three modules in the standard library that allows the developer to serialize and deserialize the objects:

  1. The pickle module
  2. The marshal module
  3. The json module

Python also supports XML, which developers can use for serializing the objects.

The json Module is the latest module out of the three. This allows the developer to work beside standard JSON files. Json is the most suitable and commonly used format for data exchange.

There are numerous reasons for choosing the JSON format:

  • It is Human Readable
  • It is Language Independent
  • It is Lighter than XML

With json Module, the developer can do serializing and deserializing different standard Python types:

  • List
  • dict
  • string
  • int
  • tuple
  • bool
  • float
  • None

The oldest module in these three modules is the marshal module. Its primary purpose is reading and writing the compiled bytecode of Python modules, or the .pyc files which the developer gets when the Python module is imported by the interpreter. Therefore, the developer can use the marshal module for serializing the object it is not recommended.

The pickle module of Python is another method of serializing and deserializing the objects in Python. This is different from json module as in this. The object is serialized in the binary format, whose result is not readable by humans. Although, it is faster than the others, and it can work with many other python types, including the developer’s custom -defined objects.

So, the developer can use many different methods for serializing and deserializing the objects in Python. The three important guidelines for concluding which method is suitable for the developer’s case are:

  1. Do not use the marshal module, as it is used mostly by the interpreter. And the official documentation warns that the maintainers of the Python can modify the format in backward -incompatible types.
  2. The XML and json modules are safe choices if the developer wants interoperability with different languages and human -readable format.
  3. The Python pickle module is the best choice for all the remaining cases. Suppose the developer does not want a human -readable format and a standard interoperable format. And they require to serialize the custom objects. Then they can choose the pickle module.

Inside The pickle Module

The pickle module of python contains the four methods:

  1. dump( obj, file, protocol = None, * , fix_imports = True, buffer_callback = None)
  2. dumps( obj, protocol = None, * , fix_imports = True, buffer_callback = None)
  3. load( file, * , fix_imports = True, encoding = ” ASCII “, errors = “strict “, buffers = None)
  4. loads( bytes_object, * , fix_imports = True, encoding = ” ASCII “, errors = ” strict “, buffers = None)

The first two methods are used for the pickling process, and the next two methods are used for the unpickling process.

The difference between dump() and dumps() is that dump() creates the file which contains the serialization results, and the dumps() returns the string.

For differentiation dumps() from the dump(), the developer can remember that in the dumps() function, ‘ s’ stands for the string.

The same concept can be applied to the load() and loads() function. The load() function is used for reading the file for the unpickling process, and the loads() function operates on the string.

Suppose the user has a custom -defined class named forexample_class with many different attributes, and each one of them is of different types:

  • the_number
  • the_string
  • the_list
  • the_dictionary
  • the_tuple

The example below explains how the user can instantiate the class and pickle the instance to get the plain string. After pickling the class, the user can modify the value of its attributes without affecting the pickled string. User can afterward unpickle the string which was pickled earlier in another variable, and restore the copy of the pickled class.

For example:

Output:

This is user's pickled object:    b' x80 x04 x95$ x00 x00 x00 x00 x00 x00 x00 x8c x08__main__ x94 x8c x10forexample_class x94 x93 x94) x81 x94. '       This is the_dict of the unpickled object:    {' first ': ' a ', ' second ': 2, ' third ': [ 1, 2, 3 ] }   

Example –

Explanation

Here, the process of pickling has ended correctly, and it stores the user’s whole instance in the string: b’ x80 x04 x95$ x00 x00 x00 x00 x00 x00 x00 x8c x08__main__ x94 x8c x10forexample_class x94 x93 x94) x81 x94. ‘After completing the process of pickling, the user can change their original objects making the_dict attribute equals to None.

Now, the user can process for unpickling the string into the utterly new instance. When the user gets a deep copy of their original object structure from the time when the process of pickling the object began.

Protocol Formats of the Pickle Module in Python

The pickle module is python -specific, and its results can only be readable to another python program. Although the developer might be working with python, they should know that the pickle module is advanced now.

This means that if the developer has pickled the object with some specific version of python, they might not be able to unpickle the object with the previous version.

The compatibility of this depends on the protocol version that the developer used for the while process of pickling.

There are six different protocols that the Pickle module of python can use. The requirement of unpickling the most recent python interpreter is directly proportional to the highness of the protocol version.

  1. Protocol version 0 – It was the first version. It is human readable no like the other protocols
  2. Protocol version 1 – It was the first binary format.
  3. Protocol version 2 – It was introduced in Python 2.3.
  4. Protocol version 3 – It was added in Python 3.0. The Python 2.x version cannot unpickle it.
  5. Protocol version 4 – It was added in Python 3.4. It features support for a wider range of object sizes and types and is the default protocol starting with Python 3.8.
  6. Protocol version 5 – It was added in Python 3.8. It features support for out-of-band dataand improved speeds for in-band data.

To choose a specific protocol, the developer has to specify the protocol version when they call dump(), dumps(), load() or loads() functions. If they do not specify the protocol, their interpreter will use the default version specified in the pickle.DEFAULT_PROTOCOL attribute.

Types of Pickleable and Unpickleable

We have already discussed that the pickle module of python can serialize much more types than the json module although everything is not pickleable.

The list of unpickleable objects also contains the database connections, running threads, opened network sockets, and many others.

If the user got stuck with the unpickleable objects, then there are few things they can do. The first option they have is to use the third -part library, for example, dill.

The dill library can extend the capabilities of the pickle. This library can let the user serialize fewer common types such as functions with yields, lambdas, nested functions, and many more.

For testing this module, the user can try to pickle the lambda function.

For example:

If the user tries to run this code, they will get an exception because the pickle module of python can not serialize the lambda function.

Output:

PicklingError                             Traceback (most recent call last)  <ipython-input-9-1141f36c69b9> in <module>        3         4 squaring = lambda x : x * x  ----> 5 user_pickle = pickle.dumps(squaring)    PicklingError: Can't pickle <function <lambda> at 0x000001F1581DEE50>: attribute lookup <lambda> on __main__ failed  

Now, if the user replaces the pickle module with the dill library, they can see the difference.

For example:

After running the above program, the user can see that the dill library has serialized the lambda function without any error.

Output:

b' x80 x04 x95 xb2 x00 x00 x00 x00 x00 x00 x00 x8c ndill._dill x94 x8c x10_create_function x94 x93 x94 ( h x00 x8c x0c_create_code x94 x93 x94 ( K x01K x00K x00K x01K x02KCC x08| x00| x00 x14 x00S x00 x94N x85 x94 ) x8c x01x x94 x85 x94 x8c x1f< ipython-input-11-30f1c8d0e50d > x94 x8c x08< lambda > x94K x04C x00 x94 ) )t x94R x94c__builtin__ n__main__ nh nNN } x94Nt x94R x94. '  

There is another interesting feature of dill library, such as it can serialize the whole interpreter session.

For example:

In the above example, the user has started the interpreter, imported the module, and then defined the lambda function along with a few of the other variables. They have then imported the dill library and called the dump_session() function for serializing the whole session.

If the user has run the code correctly, then they would be getting the testing.pkl file in their current directory.

Output:

$ ls testing.pkl  4 [email protected] 1 dave  staff  493 Feb  12 09:52 testing.pkl  

Now, the user can start the new instance of the interpreter and load the testing.pkl file for resorting to their last session.

For example:

Output:

dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , ' globals().items() ' ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\Users \User Name \AppData \Local \Programs \Python \Python39 \Scripts ' ] ) , ( ' In ' , [ ' ' , ' globals().items() ' ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , ' globals().items() ' ) ] )  

Output:

dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , " squaring = lambda x : x * x na = squaring( 25 ) nimport math nq = math.sqrt ( 139 ) nimport dill ndill.dump_session( ' testing.pkl ' ) nexit() " ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\ Users\ User Name \AppData \Local \Programs \Python \Python39 \Scripts ' ] ) , ( ' In ' , [ ' ' , " squaring = lambda x : x * x np = squaring( 25 ) nimport mathnq = math.sqrt ( 139 ) nimport dill ndill.dump_session( ' testing.pkl ' ) nexit() " ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , " squaring = lambda x : x * x np = squaring( 25 ) nimport math nq = math.sqrt ( 139 ) nimport dill ndill.dump_session( ' testing.pkl ' ) nexit() " ) , ( ' _1 ' , dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > )   

Output:

625  

Output:

22.0  

Output:

(x)>  

Here, the first globals().item() statement reveals that the interpreter is in the initial state, meaning that the developer has to import the dill library and invoke load_session() for restoring their serialized interpreter session.

Developers should remember that if they are using the dill library instead of the pickle module, that standard library does not include the dill library. It is slower than the pickle module.

Dill library can serialize a wider range of objects than the pickle module, but it cannot solve every problem of serialization that the developer can face. If the developer wants to serialize the object which contains a database connection, then they cannot work with the dill library. That is an unserialized object for the dill library.

The solution to this problem is to exclude the object during the process of serialization for reinitializing the connection after the object is deserialized.

The developer can use the _getstate_() for defining which objects should be included in the pickling process and whatnot. This method allows the developer to specify what they want to pickle. If they do not override _getstate_(), then the _dict_() will be used, which is a default instance.

In the following example, the user has defined the class with several attributes and then excluded one of the attributes for the process of serialization by using _getstate_().

For example:

In the above example, the user has created the object with three attributes, and one of the attributes is a lambda, which is an unpickleable object for the pickle module. For solving this issue, they have specified in the _getstate_() which attribute to pickle. The user has cloned the whole _dict_ of the instance for defining all the attributes in the class, and then they have removed the unpickleable attribute ‘r’.

After running this code and then deserializing the object, the user can see that the new instance does not contain the ‘r’ attribute.

Output:

{'p': 25, 'q': ' testing '}  

But if the user wants to do additional initialization during the process of unpickling, such as adding the excluded ‘r’ attribute back to the deserialized instance. They can do this by using the _setstate_() function.

For example:

Here, bypassing the excluded attribute ‘r’ to the _setstate_(), the user has ensured that the object will appear in the _dict_ of the unpickling string.

Output:

{' p ': 25, ' q ': ' testing ', ' r ': < function foobar.__setstate__.< locals >.< lambda > at 0x000001F2CB063700 > }  

Compression of the Pickle Objects

The pickle data format is the compact binary representation of the object structure, but still, the users can optimize their pickle string by compressing it with bzip2 or gzip.

For compressing the pickled string with bzip2, the user has to use the bz2 module, which is provided in the standard library of python.

For example, the user has taken the string and will pickle it and then compress it by using the bz2 module.

For example:

Output:

312  len( compressed )  

Output:

262  

The user should remember that the smaller files come at the cost of the slower process.

Security Concerns with the Pickle Module

Till now, we have discussed how to use the pickle module for serializing and deserializing the objects in Python. The process of serialization is convenient when the developer wants to save the state of their objects to disk or to transmit it through the network.
Although, there is one more thing that the developer should know about the pickle module of python, that it is not very secure. As earlier, we have discussed the use of the _setstate_() function. This method is best for performing more initialization along with the unpickling process. Still, it is also used for executing arbitrary code during the unpickling process.
There is nothing much a developer can do to reduce the risk. The basic rule is the developer should never unpickle the data which comes from the untrusted source or transmitted through the insecure network. For preventing the attacks, the user can use libraries like hmac for signing the data and making sure that it has not been tampered with.

For example:

To see how unpickling the tampered pickle can expose the system of the user to the attackers.

Example –

In the above example, the process of unpickling has executed _setstate_(), which will execute a Bash command for opening the remote shell to the 192.168.1.10 system on port 8080.

This is how the user can safely test the scrip on their Mac or Linux box. First, they have to open the terminal and then use the nc command for listing the connection to port 8080.

For example:

This terminal will be for attackers.

Then, the user has to open another terminal on the same computer system and execute the python code for unpicking the malicious code.

The user has to make sure that they have to change the IP address in the code for the IP address of their attacking terminal. After executing the code, the shell is exposed to the attackers.

Now, a bash shell will be visible on the attacking console. This console can be operated directly now, on the system which is attacked.

For example:

Output:

bash: no job control in this shell    The default interactive shell is now zsh.  To update your account to use zsh, please run ` chsh -s /bin /zsh`.  For more details, please visit https://support.apple.com /kb /HT208060.  bash-3.1$  

Conclusion:

This article discussed how to serialize and deserialize the objects by using different modules of python and why the pickle module is better than others. We have also explained how some of the objects cannot be unpickled and how we can avoid the problems created by them.


You may also like