thiébaud.fr has recently pointed out that the robots.txt file from the US Department of State website contains a grave misconfiguration. A robots.txt is part of a webservers configuration and supposed to be publicly readable. It tells search engines which links on the site should (not) be indexed.
But the configuration directives which were used in robots.txt must have been mixed up from another file called “.htaccess” which is not visible to users and private the the webserver. A htaccess, among other things, tells the webserver which files it may serve to users. And so because robots.txt actually contained the instructions that were meant to go into htaccess a list of files/paths have been exposed by robots.txt to search engines which instead should have been protected by .htaccess.
The current version of the robots.txt still contains a list of paths to around 9577 (!) documents many of which sensitive un/classified and confidential. Among others, many of these now publicly available files seem to be:
- government contracts and agreements between the DoD and foreign countries,
- infos on US military basis on foreign soil
- meeting minutes between diplomats (some in a very casual tone),
- signed MoU,
- defense agreements for military operations, and strategy on the war on drugs
The original pdf files have been removed from the US DoS server, presumably during a clean-up following the Snowden leaks.
The documents are available for download by the public (at least most files from the list in the robot.txt) from archive.org due to 6 snapshots that were taken in 2012/2013. Presumably because many spiders crawling the web try to be as tolerant as possible when they encounter broken syntax in robots.txt and so ensured that the documents are now in the public domain.
This (raggedy) script fetches the documents from archive.org to your local harddrive (using TOR for this would be highly recommended).
#!/bin/bash snapshots="20120713050942 20121013154343 20121010165822 20120921054221 20130413152313 20130113162428" # orig source http://state.gov/robots.txt but also on pastebin: wget --output-document=robots.txt http://pastebin.com/raw.php?i=RE2tpyR3 for x in `echo $snapshots` do for i in `cat ./robots.txt|cut -d ' ' -f2 | tr -d '\15\32'` do if [ -e `basename $i` ]; then echo "$i already fetched" else wget https://web.archive.org/web/$x/http://www.state.gov/documents/$i; fi done done
The wider question remains: What did these files do on a webserver open to the public? Even the files weren’t exposed, the way the htaccess was configured (1 entry per file) seems like they used a manual process to protect the content. From a security perspective this breaks every rule in the book. These files should have never been stored there in the first place. Those of you eagerly subscribing to conspiracy theories might even wonder if the DoS is running a honeypot for the NSA 🙂
Valbonne Consulting provides Research & Consulting for emerging technologies in Internet/Web of Things (WoT/IoT/M2M) and Emerging-Tech. We specialise in decentralisation, security and privacy. We work across a variety of traditional industry verticals (Telecommunications, Automotive, Energy, ...). We support Open Source and technologies built on open standards.