This report summarizes the work done by me (Kishan Ved) for Google Summer of Code 2024 with the NumFOCUS organization, on the project Open Science Labs: PyDataStructs: Add a C++ Backend for tree data structures and their algorithms. Weekly reports are available here: GSoC BLogs
I am Kishan Ved, an undergraduate student at the Indian Institute of Technology Gandhinagar (IIT Gandhinagar), India, in the department of Computer Science and Engineering.
PyDataStructs aims to be a Python package for various data structures in computer science. It is under development involving addition of algorithms and their parallel implementations. To the best of our knowledge, PyDataStructs is the first well-designed library/package which has covered most of the data structures and algorithms.
Features:
-
A single package for all your data structures and algorithms - We have and are implementing many popular and useful data structures and algorithms.
-
Consistent and Clean Interface - The APIs we have provided are consistent with each other, clean and easy to use. We make sure of that before adding any new data structure or algorithm.
-
Well Tested - We thoroughly test our code before making any new addition to PyDataStructs. 99 percent lines of our code have already been tested by us.
So, you can easily rely on PyDataStructs for any data structure or algorithm you want to use without worrying about implementing it from scratch. Everything is just a few calls away.
Project link: Open Science Labs: PyDataStructs: Add a C++ Backend for tree data structures and their algorithms
My project involved adding a C++ backend for all tree data structures in PyDataStructs, a Python package for advanced data structures and algorithms. The user has an option to select either the Python backend or the C++ backend.
tree = RedBlackTree(backend=Backend.CPP)
For any data structure, the Python backend is developed first, and once completely tested and ready, its C++ backend is developed. Both the backends share full functionality and are completely compatible. The C++ backend is extremely fast, it executes codes 8-10 times faster. This enhances the computation speed, making it extremely valuable for scientific computing and high-performance applications.
I started working from the community bonding period itself, and this gave me a good headstart and allowed me to complete the project in 12 weeks. Here's an outline of all the work I did:
PR Description | Contribution |
---|---|
PR: Added Introsort algorithm | |
PR: Fixed version related bugs |
PR Description | Contribution |
---|---|
PR: C++ backend for Node, TreeNode, ArrayForTrees, BinaryTree and BinarySearchTree and all tree traversals implemented |
PR Description | Contribution |
---|---|
PR: C++ backend for Self Balancing Binary Tree | |
PR: C++ backend for Red Black Trees | |
PR: C++ backend for Binary Indexed Trees | |
PR: C++ backend for Splay Trees |
PR Description | Contribution |
---|---|
PR: C++ backend for AVL Trees | |
PR: C++ backend for Cartesian Trees | |
PR: C++ backend for Treap | |
PR: C++ backend for all trees in binary_trees.py file complete |
|
PR: Updated Documentation |
Lines added:
Commits made: 12
Total merged Pull Requests : 12
Here's a complete list of all my merged PRs
Click to see a benchmark code
To run this code, you need PyDataStructs installed with the following imports:
import timeit, functools, os, pytest
from pydatastructs.trees.binary_trees import (BinarySearchTree, RedBlackTree)
from pydatastructs.utils.misc_util import Backend
def test_BinarySearchTree(**kwargs):
cpp = Backend.CPP
repeat = 1
number = 1
size = int(os.environ.get("PYDATASTRUCTS_BENCHMARK_SIZE", "1000"))
size = kwargs.get("size", size)
BST = BinarySearchTree
b1 = BST(backend=Backend.PYTHON)
b2 = BST(backend=Backend.CPP)
def f(backend, tree):
for node in range(-1000,1000):
tree.insert(node, node)
def g(backend, tree):
for node in range(-1000, 1000):
tree.search(node)
def h(backend, tree):
for node in range(2000):
tree.delete(node)
kwds_dict_PY = {"backend": Backend.PYTHON, "tree":b1}
kwds_dict_CPP = {"backend": Backend.CPP, "tree":b2}
timer_python = timeit.Timer(functools.partial(f, **kwds_dict_PY))
python_insert = min(timer_python.repeat(repeat, number))
timer_cpp = timeit.Timer(functools.partial(f, **kwds_dict_CPP))
cpp_insert = min(timer_cpp.repeat(repeat, number))
assert cpp_insert < python_insert
timer_python = timeit.Timer(functools.partial(g, **kwds_dict_PY))
python_search = min(timer_python.repeat(repeat, number))
timer_cpp = timeit.Timer(functools.partial(g, **kwds_dict_CPP))
cpp_search = min(timer_cpp.repeat(repeat, number))
assert cpp_search < python_search
timer_python = timeit.Timer(functools.partial(h, **kwds_dict_PY))
python_delete = min(timer_python.repeat(repeat, number))
timer_cpp = timeit.Timer(functools.partial(h, **kwds_dict_CPP))
cpp_delete = min(timer_cpp.repeat(repeat, number))
assert cpp_delete < python_delete
print("Python Time:")
print("insert(): ",python_insert,"s")
print("search(): ",python_search,"s")
print("delete(): ",python_delete,"s")
python_total = python_insert+python_search+python_delete
print("Total Python time: ", python_total,"s\n")
print("C++ Time:")
print("insert(): ",cpp_insert,"s")
print("search(): ",cpp_search,"s")
print("delete(): ",cpp_delete,"s")
cpp_total = cpp_insert+cpp_search+cpp_delete
print("Total C++ time: ", cpp_total,"s\n")
print("C++ backend is",round(python_total/cpp_total),"times faster!")
test_BinarySearchTree()
Time taken for methods of Binary Search Tree class to execute in different backends:
The picture clearly indicates the utility of the C++ backend. It makes code execution much faster. This is helpful for high-performance computing.
My Google Summer of Code blogs are available on my website: https://kishan-ved.github.io/portfolio/blog/
Read about my open source experience, right from my very first open source contribution to getting selected for GSoC and completing my Google Summer of Code project!
My Google Summer of Code Journey
My project is complete, the C++ backend for all trees is fully functional. Some (non-critical) issues have been opened, these need to be addressed. For upcoming plans (and major goals), refer PyDataStructs Wiki on GitHub.
GSoC has significantly accelerated my learning curve and taught me to write clean, robust and production ready code for solving complex, real-world problems. Here are some of my key learnings from GSoC:
Tech: Mastered the art of linking a Python code to a C++ backend by using the Python-C API to improve speeds greatly. Polished my C++ and Python coding skills.
GitHub: Learned various new commands, resolution of conflicts and merging branches for collaborative work.
Perseverance: GSoC taught me to read the documentation, be calm and perseverant. It's difficult at the start but smoother ahead!
Thanks to my mentor Gagandeep Singh for his support and guidance. Thanks to Ivan Ogasawara and the team at Open Science Labs and NumFOCUS.