Go to file
2024-01-27 02:42:11 +00:00
commoncrawl_local_to_share_move.ps1 Update to zst extension 2024-01-27 02:26:27 +00:00
commoncrawl_transfer.ps1 Added in Lock file support 2024-01-27 02:42:11 +00:00
commoncrawl_url_processor.py Update commoncrawl_url_processor.py 2023-12-12 10:05:46 +00:00
prerequisites.sh Update prerequisites.sh 2024-01-24 06:00:42 +00:00
README.md Update README.md 2024-01-26 07:17:23 +00:00
rsyncd.conf Upload files to "/" 2023-12-04 10:48:04 +00:00
rsyncd.passwd Upload files to "/" 2023-12-04 10:49:28 +00:00
urlextractor_archiveteam.sh Commented out URL Extractions (to be done post downloading of files) 2024-01-12 04:02:06 +00:00
warc_wat_url_processor.py Update concurrency 2024-01-26 08:01:39 +00:00

Overview:

Use the following set of scripts to extract urls from CommonCrawl WAT files and output to a compressed txt file.

This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.)

The script can be run without the second part but the server will quickly fill if large data sets are being processed (eg; CommonCrawl WARC/WAT Files).

Requirements:

  • Python3
  • Gzip
  • Axel
  • Parallel
  • Rsync or WinSCP
  • PowerShell
  • Zstd

Pre-Setup Steps:

Before running the scripts, there are some steps required to setup the stack.

A script has been devised to automate the steps.

Run "prerequisites.sh" to setup the stack.

Steps:

  1. Run url_extractor.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at.
  2. Once the script has completed running, verify the output of the file.

Data Transfer:

There are two methods of transferring data from the server running the scripts back to a local machine being rsync or FTP.

From testing I have found that while rsync is robust and reliable it is quite slow and can only move a single file at a time. When running multiple instances at a time it seems that no additional bandwidth is consumed and instead the files fight each other for bandwidth.

I have developed a powershell script that uses WinSCP and the FTP protocol. Running multiple instances of the script significantly speeds up the data transfer rate. I strongly recommend using this method to download files from the server.

Notes:

  • A text file with all of the CommonCrawl links named "urls_to_download.txt" is required to be located in /opt/CommonCrawl_URL_Processor.
  • I took the lazy way of setting up FTP on the server by allowing the root account SSH access to the server without a private key.
  • WinSCP Automation is required to be downloaded and copied over the top of the installed WinSCP installation directory for the script to work.