Hi all! In this post I'll talk about the PR #437.
There are several reasons to have a streaming system for data visualization. From my background of doing a PhD in a development country I always need to think in the cheapest way to use the computational resources available. For example, with the GPU’s prices increasing, it’s necessary to share a machine with a GPU with different users in different locations. Therefore, to convince my Brazilian friends to use FURY I need to code thinking inside of the low-budget scenario.
To construct the streaming system for my project I thinking about the following properties and behaviors:
- I want to avoid blocking the code execution in the main thread (where the vtk/fury instance resides)
- The streaming should work inside of a low bandwidth enviroment
- I need a easy way to share the rendering result. For example using the free version of ngrok
To achieve the property 1. we need to circumvent the GIL problem. Using the threading module alone it’s not good enough because we can’t use the python-threading for parallel CPU computation. In addition, to achieve a better organization it’s better to define the server system as an uncoupled module. Therefore, I believe that multiprocessing-lib in python will fit very well for our proposes.
To the streaming system works smoothly in a low-bandwidth scenario we need to choose the protocol wisely. In the recent years the WebRTC protocol has been used in myriad of applications like google hangouts and Google Stadia aiming low latency behavior. Therefore, I choose the webrtc as a my first protocol to be available in the streaming system proposal.
To achieve the third property, we must be economical in adding requirements and dependencies.
Currently, the system has some issues, but it's already working. You can see some tutorials about how to use this streaming system here. After running one of this examples you can easily share the results and interaction with another users. For example, using the ngrok
./ngrok http 8000
How it works?
The image bellow it's a simple representation of the streaming system.
As you can see, the streaming system is made up of different processes that share some memory blocks with each other. One of the hardest part of this PR was to code this sharing between different objects like VTK, numpy and the webserver. I'll discuss next some of technical issues that I had to learn/circunvent.
Sharing data between processWe want to avoiding any kind unnecessary duplication of data or expensive copy/write actions. We can achieve this economy of computational resources using the multiprocessing module from python.
from multiprocessing allows to share resources between different process. However,the are
some tricks to get a better performance when we are dealing with RawArray's.
take a look in my PR in a older stage.
In this older stage my streaming system was working well. However, one of my mentors (Filipi Nascimento)
saw a huge latency for high-resolutions examples. My first thought was
that latency was caused by the GPU-CPU copy from the opengl context. However, I discovered that
I've been using RawArray's wrong in my entire life!
See for example this line of code fury/stream/client.py#L101 The code bellow shows how I've been updating the raw arrays
raw_arr_buffer[:] = new_data
This works fine for small and medium sized arrays, but for large takes a large amount of time, more than GPU-CPU copy. The explanation for this bad performance it's available here : Demystifying sharedctypes performance. The solution which gives a stupendous performance improvement is quite simple. RawArrays implements the buffer protocol. Therefore, we just to need the use the memoryview:
memview(arr_buffer)[:] = new_data
The memview it's realy good, but there it's a litte issue when we are dealing with uint8 RawArrays. The following code will cause an exception
memview(arr_buffer_uint8)[:] = new_data_uint8
There is a solution for uint8 rawarrays using just memview and cast methods. However, numpy comes to rescue and offers a simple and a more a generic solution. You just need to convert the rawarray to a np representation in the following way
arr_uint8_repr = np.ctypeslib.as_array(arr_buffer_uint8) arr_uint8_repr[:] = new_data_uint8
You can navigate to my repository in this specific commit position and test the streaming examples to see how this little modification improves the performance.
Multiprocessing inside of different Operating Systems
Serge Koudoro, which is one of my mentors, have pointed an issue of the streaming system running in MacOs. I don't know many things about MacOs, and as pointed by Filipi the way that MacOs deals with multiprocessing it's very different than the Linux approach. Although we solved the issue discovered by Serge, I need to be more carefully to assume that different operating system will behave in the same way. If you want to know more, I recommend you to read this post Python: Forking vs Spawm. And it's also important to read the official documentation from python. It can save you a lot of time. Take a look what the official python documentation says about the multiprocessing method<small>Source: https://docs.python.org/3/library/multiprocessing.html</small>