Apache provides a pretty standard screen to display directoyr contents if you do not provide any mods. We post artifacts up to a local server that I later need to download. Here are my hacky notes using command line utilities. I probably will convert this to python next.
If you download a default page using curl you get something like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /gitlab-ci/ftk</title>
</head>
<body>
<h1>Index of /gitlab-ci/ftk</h1>
<pre><img src="/icons/blank.gif" alt="Icon "> <a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <a href="?C=D;O=A">Description</a><hr><img src="/icons/back.gif" alt="[PARENTDIR]"> <a href="/gitlab-ci/">Parent Directory</a> -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507170316/">202507170316/</a> 2025-07-18 15:10 -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507182212/">202507182212/</a> 2025-07-18 16:22 -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507211853/">202507211853/</a> 2025-07-21 13:16 -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507212059/">202507212059/</a> 2025-07-21 14:56 -
<hr></pre>
</body></html>
Since they are sorted by time, we can grab the first one on the list (assuming we want the latest, which I do). First, download it to a file so we don’t have to wait for a download each iteration.
url https://yourhostname.net/fileserver/ > ~/file_list.html
You can use XPath to navigate most of the html. At first,I kinda gave up and used awk, but suspected I would get a cleaner solution with just XPath if I stayed on it. And I did:
xmllint --html --xpath "//html/body/pre/a[text()='Parent Directory']/following::a[1]/@href" ~/file_list.html
This gives me
href="202507170316/"
I don’t think XPath will remove the quotes or the href, but I can deal with that.
Here is the same logic in Python:
#!/bin/python3
import requests
from lxml import etree
url = "https://yourserver.net/file/"
response = requests.get(url)
html_content = response.text
# Parse the HTML content using lxml
tree = etree.HTML(html_content)
# Use XPath to select elements
# Select the text content of the first h1 tag
val = tree.xpath("//html/body/pre/a[text()='Parent Directory']/following::a[1]/@href")[0]
print(val)
#To keep going down the tree...
release_url=url + val
response = requests.get(release_url)
html_content = response.text
print(html_content)