Scikit Build Proposal

I’ve spent the last few years trying to make it easy for anyone to extend Python with compiled languages. I’ve worked on pybind11, a powerful C++ library that allows users to write advanced Python extensions using just C++11, used by some of the largest projects, SciPy, PyTorch, Google, LLVM, and tens of thousands of other libraries, down to very small extensions. I also work on cibuildwheel, which makes building binaries (called wheels) on continuous integration (CI) simple. It is again powerful enough to used by huge projects, like Scikit-learn, matplotlib, mypy; and is simple enough to be used by hundreds of other packages. Recently it was accepted into the Python Packaging Authority (PyPA). There is one missing piece, though, to complete this picture of compiled extensions that easy to use for small projects, and powerful enough for large projects: the build system. I believe the solution to that is scikit-build, and I’d like to work on it over the next three years, and I need your help.

Scikit-build is a tool for integrating a package with a CMake build system into Python. You can utilize the vast collection of packages and projects using CMake already, and you have access to modern building features, like multithreaded builds, library discovery, superb compiler and IDE support, and all sorts of extended tooling. Modern CMake is quite pleasant to write compared to times past; I have written a book and training course on it. We ship up-to-date cmake and ninja wheels for all binary platforms.

I’m writing a proposal for an NSF CSSI Elements project containing three parts. The first part will cover core development on Scikit-build to address the current shortcomings and to prepare it for a post-distutils (Python 3.12+) world. The second part would cover assisting libraries with a science use case in either transitioning to scikit-build (ideally from an existing CMake build system with Python bindings, but I can help mentor developers in writing bindings (ideally pybind11), setting up CI, and writing CMake code as well (see my book or workshop on Modern CMake, and I’m happy to help old scikit-build projects transition to better practices). As part of this, I would be building up the examples and documentation, leading into the third part of the proposal: A series of training events and training material, including plans for something alongside SciPy.

Do you have an interesting use case? I’d love to hear about it. Feel free to reach out to me on twitter @henryschreiner3 or by email, henryfs at princeton.edu. I’m looking projects that would be interested in trying out the new scikit-build during years two and three. Tangibly, I’m looking for collaboration letters and a 1-3 sentence description of how we will collaborate over the course of the three year project - I can help you try out scikit-build over the next three years! After the break, I will give a few more details about the problem and outline my specific plans in a bit more detail. You can also see an outline at scikit-build/scikit-build/wiki or at the end of this post. Deadline is Dec 8, 2021, so letters need to be completed by the 7th or sooner!


Intro to build systems

Build systems in Python are hard. Especially for compiled extensions. So much so that the official method for compiling extensions, distutils, is deprecated and is slated to be removed from Python in 3.12 in 2023 (PEP 632). The third-party copy of this, setuptools, is only slightly better with minor patches for broken functionality, but using this system to compile anything is difficult and error prone. NumPy has over 13,000 lines of code dedicated to building with distutils. This build system is purely Python, requiring any existing project to have to completely reinvent building for distutils/setuptools, as well as implement library searches, most aspects of compiler support, caching, and multithreaded building themselves.

The most popular C/C++/Fortran/CUDA build system is CMake by Kitware, which is either used or at least supported by almost every C++ project, editor, compiler stack, debugger, or other utility. This is an ideal candidate, and an important one for scientific work: it is the most widely used portable build system in scientific computing, and supporting it enables access to a vast treasure trove of existing libraries, either for direct wrapping, or as a library in a larger project. Scikit-build also provides PyPI hosted binaries (wheels), making cmake and ninja just a pip install away on most platforms; using cibuildwheel, I’ve helped expand this to every system covered by wheels, including Apple Silicon, Linux Arm, PowerPC, and muslinux.

Scikit-build was developed by Kitware in 2016 (as PyCMake) to connect CMake to the Python packaging process. It was featured at the SciPy conference several times. Scikit-build has already shown great promise and initial adoption. A search over the most popular GitHub libraries shows several hundred packages are using Scikit-build. One of the most recent examples is clang-format-wheel, which uses Scikit-build to leverage the existing LLVM CMake build system to build wheels using freely available GitHub Actions. These wheels are under 2 MB in size, cover the last for LLVM versions, and are easily installed on all common platforms. The total code required was under a page of CMake, and under a page of Python, and about a page of CI code. Other examples include pyjion (a project by a CPython core developer working on a pluggable JIT interface to CPython), xeus-python (an interpreted C++ kernel for Jupyter), opencv-python, Parselmouth, Ogre3D (open source graphics rendering engine), deepmd (molecular dynamics deep leaning toolkit), xdoctest, DUNE (partial differential equations grid solver), goofit (High Energy Physics GPU fitter), and many more.

Plan of work

Part 1: core work

Like other build related packages of that era, scikit-build was (and still is) implemented on top of the distutils/setuptools functions. This was all that was available then; there was no public API for implementing a package builder. This shortcoming was addressed in PEP 517 by standardizing the interface for package builders, and in PEP 518 which allowed users to specify the exact build requirements for their package.

The scikit-build package needs be rebuilt without usage of distutils. Building scikit-build-core without a reliance on the classic build system will enable it to adopt two key enhancements in modern Python packaging; the ability to be run directly by the builder (PEP 517), and the ability to be configured using the universal project configuration language (PEP 621). This will enable scikit-build-core to be completely free from legacy setup.py code, reducing the number of things a new user has to learn and the number of files they need to write. The currently partially broken caching system will also be overhauled to take advantage of the new editable support (PEP 660) that was just included in the latest pip (Python’s packing installer) release, 21.3. The new scikit-build core will also have a user accessible API, allowing integration into other build systems already using PEP 517 as a plugin, such as Poetry and Trampolim.

The next priority will be providing better support and integration with CMake. The current integration is based on Python CMake modules that have been deprecated for years. This will replace or update the existing module with support for the modern CMake FindPython module; the author previously has demonstrated that dual-supporting both Python discovery systems is possible with the popular pybind11 library. There are also several problematic choices with the current modules causing issues to be raised; these will be addressed and replaced.

To enable working in the broader ecosystem of scientific tools in a Python package manager, we will bring CMake’s excellent package search capabilities to Python modules. An extension discovery system will allow other packages to define scikit-build specific entry points to broadcast that they provide CMake config files. This will enable an ecosystem of packages that provide reusable CMake interfaces inside normal Python packages. This will be initially integrated with the pybind11 library.

Edit: We will be working with Kitware directly to integrate this functionality into CMake for all users; you should be able to access Python package configurations even without scikit-build driving the configuration!

Part 2: Adoption assistance

I would like to work with half a dozen or so projects with science use cases that can adopt Scikit-build. Ab ideal project would be one with an existing CMake build, and either existing Python bindings, or no Python bindings expected (just using wheels for distribution, like clang-format). But I am a core developer on pybind11, so would be willing to help a project get started with pybind11 bindings. I’m also interested in helping existing scikit-build projects use the latest tooling, as well as help ensure the old techniques do not regress.

For smaller projects, I’d still be happy to know you’d be interested in being involved, and would help run downstream testing / CI.

Part 3: Training

To improve the experience for beginners, the documentation will need to be improved and integrated with the existing examples. A template to produce a scikit-build package (similar to scikit-hep/cookie) will be provided, allowing a new project to be started in less than 60 seconds. The use of cibuildwheel, the official tool for building redistributable binaries for Python, will be featured prominently. Several new types of builds, such as for extensions that do not actually extend Python (like the clang-format example above), CUDA libraries, and library discovery, will be added, as well.

There will be a strong focus on training; during the final portion of the project, multiple training workshops will be held near the end of the project to train participants in the building of binary extensions for Python. These will cover all aspects of building a project with a strong focus on best practices. Participants will build a pybind11 extension that can be built and then deployed instantly on any common operating system using the infrastructure developed above.

Alternatives

There are not many other ways to do this, but let’s quickly mention them, and why I believe scikit-build is the best.

Setuptools: The classic method of building. I think I’ve covered the problems above: it is very limited, not being improved. It can only support very basic builds, and even then, is very tricky to set up. Supporting more than that (like NumPy does) requires thousands of lines of code.

Enscons: This is the only existing PEP 517 builder with compilation support at the time of writing. Unfortunately, this is based on SCONS, which was abandoned by most projects years ago for a variety of issues. The reason this was built was because SCONS was pure Python.

Meson: This is the tool selected by SciPy, and they are working on making it PEP 517 compatible and capable of building SciPy. The biggest problem with it is far less support than CMake; most software useful for science doesn’t happen to have a Meson build system, unlike CMake. Most IDEs don’t support it, most tools don’t support it, etc. Though it requires Python, it still has a custom DSL for configuration.

It is more opinionated than CMake, and doesn’t really seem to solve anything other than simply being newer than CMake, so it has fewer bad examples and historical baggage (for now). It doesn’t support functions or macros, since it thinks that will make your configuration too hard to read - for a project like ATLAS, with 2,000 modules and millions of lines of C++ and Python, I don’t think this is remotely feasible in Meson. I understand the idea, but in practice there are times that building code is not pure configuration, and giving users a choice may be better than forcing them to conform. I have been told it has made a few conceptual mistakes related to Python that are likely to be hard to fix.

But mostly, you lose access to the huge collection of scientific packages that already support CMake. I believe all reasons SciPy chose Meson over scikit-build are addressed in this proposal. They also are not worried about supporting other previously existing codebases.

I do expect this to potentially become another great option, however. Meson seems to be replacing autotools more than CMake at the moment. Anything that reduces usage of autotools is good. We will be collaborating with SciPy during the proposal to share common designs and solutions.

Edit: we will be collaborating with the Meson SciPy project to come up with shared solutions.

Bazel: Google’s tool is like many Google tools; mostly designed for Google. It is unlikely that it would be able to be adapted properly for Python any time soon, and doesn’t provide nice Python distribution, like CMake does. Most usage of Bazel is tied to Google in some way. It does have more existing package support than Meson, at least, due to the weight of Google.

Summary

Here is a condensed outline of key features planned for part 1.

  • Classic interface (scikit-build)
    • Work on trying to fix caching or documenting deficiencies/workarounds
    • Avoid usage of distutils (deprecated and to be removed in Python 3.12)
    • Some static typing (limited by setuptools)
    • Replace backend with scikit-build-core (below)
    • Support setup.cfg configuration too
  • New interface: scikit-build-core
    • Public API for manipulating CMake
    • Includes a PEP 517 backend, configured via PEP 621 and a tool section
    • Can be used as a basis for a Poetry plugin or Trampolim file
    • Standard scikit-build will be rebased on this.
  • CMake helper files
    • Update for newer CMake features, like dual-support for FindPython.
    • Work on better Fortran support
  • General plans
    • Module discovery system: a module can contain CMake Config files (pybind11, to be joined by NumPy and SciPy)
    • Improve the test system
    • Support PEP 660 editable installs, improve simi-broken caching

Part 2 will work closely with several projects to adapt to or work with scikit-build. This will depend on what projects are interested in working with me on using scikit-build. This will also likely help improve scikit-build by exposing any missing functionality for special cases.

Part 3 will focus on outreach and training.

  • Add a cookiecutter to make it easy to setup a new project combining Python and C++
  • Work on expanded documentation, updating and writing tutorials
  • A new website, scikit-build.org
  • Provide examples for common situations, like a ctypes extension in Scikit-build.
  • Several workshops
  • Collaborations with US-RSE
  • Links and examples on websites like us-rse.org, numpy.org (already added), and cmake.org.
comments powered by Disqus