Find files to download from Apache default pages

Apache provides a pretty standard screen to display directoyr contents if you do not provide any mods. We post artifacts up to a local server that I later need to download. Here are my hacky notes using command line utilities. I probably will convert this to python next.

If you download a default page using curl you get something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /gitlab-ci/ftk</title>
 </head>
 <body>
<h1>Index of /gitlab-ci/ftk</h1>
<pre><img src="/icons/blank.gif" alt="Icon "> <a href="?C=N;O=D">Name</a>                    <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <a href="?C=D;O=A">Description</a><hr><img src="/icons/back.gif" alt="[PARENTDIR]"> <a href="/gitlab-ci/">Parent Directory</a>                             -   
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507170316/">202507170316/</a>           2025-07-18 15:10    -   
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507182212/">202507182212/</a>           2025-07-18 16:22    -   
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507211853/">202507211853/</a>           2025-07-21 13:16    -   
<img src="/icons/folder.gif" alt="[DIR]"> <a href="202507212059/">202507212059/</a>           2025-07-21 14:56    -  
<hr></pre>
</body></html>

Since they are sorted by time, we can grab the first one on the list (assuming we want the latest, which I do). First, download it to a file so we don’t have to wait for a download each iteration.

url https://yourhostname.net/fileserver/ > ~/file_list.html

You can use XPath to navigate most of the html. At first,I kinda gave up and used awk, but suspected I would get a cleaner solution with just XPath if I stayed on it. And I did:

xmllint --html  --xpath "//html/body/pre/a[text()='Parent Directory']/following::a[1]/@href" ~/file_list.html

This gives me

href="202507170316/"

I don’t think XPath will remove the quotes or the href, but I can deal with that.

Here is the same logic in Python:

#!/bin/python3

import requests
from lxml import etree

url = "https://yourserver.net/file/"
response = requests.get(url)
html_content = response.text

# Parse the HTML content using lxml
tree = etree.HTML(html_content)

# Use XPath to select elements
# Select the text content of the first h1 tag
val = tree.xpath("//html/body/pre/a[text()='Parent Directory']/following::a[1]/@href")[0]
print(val)
#To keep going down the tree...
release_url=url + val

response = requests.get(release_url)
html_content = response.text
print(html_content)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.