We have a build system that has grown organically. It started as a shell script. We needed to run it from gitlab, so we wrote helper scripts to insulate our code from gitlab. Then we added some helper functions to mimic the gitlab interactions when working with them from the comand line. The helper functions grew until you could not practically run the original shell script without them.
It is a mess.
I want to refactor it.
Refactoring Shell is painful.
I want objects. I want python.
So I am rewriting the gitlab and functions layer in python with an eye to rewriting the whole thing. Here’s what I have learned;
Table of contents
Approach
While it is possible to call a bash script from python, it is not possible to source a bash file and then call the functions it defines. Because the original code makes heavy use of bash functions, I cannot do a function-by-function port to Python. Instead, I have to start either at the top and work down, or at the bottom and work up.
What do I mean? By the top, I mean the interface where you call the code. We have a set of very short scripts that are designed to be called from gitlab. For this example, I have a script called pipeline_kernel_next.sh. I can wrap this with a python script that first just calls the pipeline_kernel_next.sh , and make sure everything runs. Then I start duplicating the code in pipeline_kernel_next.sh in python, and comment out the code the call to the shell functions.
By the bottom I mean replacing a function with a call to a python script. I could remove a function from the bash code, and instead have bash call a remote script. That remote script would then be written in python. I avoided doing that as I did not want to get deep into the bash code. I might follow this approach for some things in the future.
So far I have only done the top down approach. I started by writing a script that just calls the existing script, and made sure that ran. Once I got that far, I started disabling pieces of the main script and re-implementing them in python, one at a time. This mostly could be done without modifying the original scripts, and instead was done by only implementing a subset of their functionality in each pass.
Subprocess
The standard python library has goingthrou a few iterations of how to run another process. The current approach is a mechanism called subprocess. While this is a very full implementation, it is not always intuitive to use.
If I wanted to run the bash command:
git reset --hard |
I would have to execute the following code.
subprocess.run(["git", "reset", "--hard"]) |
Which is pretty simple. However, if I want to capture the output from that command and append it to the console, it gets more complex. Add in the need for pipes and other bash-isms and you end up writing a bit of boilerplate code for each invocation. Other people have gone through this and come up with wrappers that make it simpler.
Plumbum
Ricardo, A fellow Python-Over-Coffee participant, suggested i take a look at the plumbum library. It has been a great starting point. With it, I can line-for-line convert the majority of the shell scripts to python. It is syntactic sugar for the various mechanisms in python for working with other processes: you could do it directly but you would end up writing a lot of boilerplate code around each line.
With plumbum, a section that looked like this in shell:
git config rerere.enabled true git rebase --quit >/dev/null 2>&1 git reset --hard >/dev/null 2>&1 git clean -f -x -d >/dev/null 2>&1 git remote update $ARGH pull |
can be converted to this in python
argh = local[os.getcwd() + '/argh'] git = local['git'] git("config", "rerere.enabled", "true") git("rebase", "--quit") git("reset", "--hard") git("clean", "-f", "-x", "-d") git("remote", "update") argh("pull") |
Directories
I am trying to minimize the code structure changes from shell to python to make it easier for my teammates to read the code in transition. However, one issue that I find I cannot work around in line is the building of directories: our naming scheme is complex enough that I keep getting things wrong. I have collected all of the directories into a single object that gets statically allocated when the python file is first run. Yeah, it is a singleton approach, and I don’t foresee it staying like this permanently, but it works to get my to python. The directories are then cached parameters. This means that they are lazy loaded and built on demand. An example:
top_dir = os.getcwd() class Directories: @cached_property def stage(self): dirname = top_dir + "/stage" os.makedirs(dirname, exist_ok=True) return dirname @cached_property def repo(self): return top_dir + "/linux" dirs = Directories() |
Now access to the repo directory can be referenced as:
dirs.repo |
With
One thing that has tripped me up in the shell code base is that we constantly are changing directories: we have to be in a parent to run git clone, but inside the repo directory to run any git commands. If we then forget to change back, we are running commands in the wrong directory.
Plumbum allows me to isolate the commands that are to be run in a subdirectory using the python keyword with. The above code then gets indented and looks more like this:
with local.cwd(dirs.repo): git("config", "rerere.enabled", "true") git("rebase", "--quit") git("reset", "--hard") git("clean", "-f", "-x", "-d") git("remote", "update") argh("pull") |
- clone the git repos.
- kick off the build script
- move the produced files into the right location
- change the names of the files to match our convention
- upload the files to the remote server via scp
Insulation
I found it essential in doing this work was to insulate my development from production. This may seem obvious, but in git ops stuff, you often find you are working in and on the production pipeline, otherwise you don’;t know that things work end-to-end. When all you are doing is checking out from the main git repo, it is ok to use the production git repo. When you are building new trees and tagging them, then pushing those trees to a remote repo, you want to work with your own remote repo, not production. However, much dev-git-ops code is written with remote-repos hard coded in, and so you need to make that stuff overridable. Which in my case means I do have to modify the shell scripts at least that far.
Edit: a big thanks to the members of the Boston Python Users Group Slack that provided feedback and editing for this article.