Go to file
2024-03-31 11:55:02 +00:00
archive_org_url_processor.py Add archive_org_url_processor.py 2024-02-15 11:29:58 +00:00
commoncrawl_local_to_share_move.ps1 Update commoncrawl_local_to_share_move.ps1 2024-01-27 06:28:44 +00:00
commoncrawl_transfer.ps1 Update commoncrawl_transfer.ps1 2024-01-27 06:28:16 +00:00
commoncrawl_url_processor.py Update commoncrawl_url_processor.py 2023-12-12 10:05:46 +00:00
prerequisites.sh Update prerequisites.sh 2024-03-31 11:38:19 +00:00
README.md Update README.md 2024-03-31 11:55:02 +00:00
warc_wat_url_processor.py Update warc_wat_url_processor.py 2024-03-31 11:45:33 +00:00

Overview:

Use the following set of scripts to extract urls from WARC/WAT files and output to a compressed zst txt file.

This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.)

The script can be run without the second part but the server will quickly fill if large data sets are being processed (eg; CommonCrawl WARC/WAT Files).

Requirements:

  • Python3
  • Gzip
  • Axel
  • Parallel
  • WinSCP
  • PowerShell
  • Zstd

Pre-Setup Steps:

Before running the scripts, there are some steps required to setup the stack.

A script has been devised to automate the steps.

Run "prerequisites.sh" to setup the stack.

Steps:

  1. Download warc_wat_url_processor.py to /opt/
  2. Upload the urls_to_download.txt file containing all of the urls of files to be downloaded
  3. Configure commoncrawl_transfer.ps1 to include the IP Address of the Server and SSH Passkey for authenication

Data Transfer:

I have developed a powershell script that uses WinSCP and the FTP protocol. Running multiple instances of the script significantly speeds up the data transfer rate. I strongly recommend using this method to download files from the server.

Notes:

  • The scripts current configuration is set for 7 download tasks concurrently, at a concurrency of 1 connection per file. This is the best configuration for the downloading from the CommonCrawl Endpoint without being temporarily blocked with HTTP 403's.
  • A text file with all of the CommonCrawl links named "urls_to_download.txt" is required to be located in /opt/.
  • I took the lazy way of setting up FTP on the server by allowing the root account SSH access to the server without a private key.
  • WinSCP Automation is required to be downloaded and copied over the top of the installed WinSCP installation directory for the script to work.