This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.)
The script can be run without the second part but the server will quickly fill if large data sets are being processed (eg; CommonCrawl WARC/WAT Files).
I have developed a powershell script that uses WinSCP and the FTP protocol. Running multiple instances of the script significantly speeds up the data transfer rate. I strongly recommend using this method to download files from the server.
<li>The scripts current configuration is set for 7 download tasks concurrently, at a concurrency of 1 connection per file. This is the best configuration for the downloading from the CommonCrawl Endpoint without being temporarily blocked with HTTP 403's.</li>
<li>A text file with all of the CommonCrawl links named "urls_to_download.txt" is required to be located in /opt/.</li>