From 319d65735c7a41efc8d87b9666c9bb7de7b96663 Mon Sep 17 00:00:00 2001
From: datechnoman <datechnoman@hotmail.com>
Date: Sun, 31 Mar 2024 11:55:02 +0000
Subject: [PATCH] Update README.md

---
 README.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/README.md b/README.md
index 19d0a34..1119ca8 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 <b>Overview:</b>
 
-Use the following set of scripts to extract urls from CommonCrawl WAT files and output to a compressed txt file. 
+Use the following set of scripts to extract urls from WARC/WAT files and output to a compressed zst txt file. 
 
 This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.)
 
@@ -27,8 +27,10 @@ Run "prerequisites.sh" to setup the stack.
  
 <b>Steps:</b>
 
-1. Run url_extractor.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at.
-2. Once the script has completed running, verify the output of the file.
+1. Download warc_wat_url_processor.py to /opt/
+2. Upload the urls_to_download.txt file containing all of the urls of files to be downloaded
+3. Configure commoncrawl_transfer.ps1 to include the IP Address of the Server and SSH Passkey for authenication
+4. 
 
 <b>Data Transfer:</b>
 
@@ -37,7 +39,8 @@ I have developed a powershell script that uses WinSCP and the FTP protocol. Runn
 <b>Notes:</b>
 
 <ul>
-  <li>A text file with all of the CommonCrawl links named "urls_to_download.txt" is required to be located in /opt/</li>
+  <li>The scripts current configuration is set for 7 download tasks concurrently, at a concurrency of 1 connection per file. This is the best configuration for the downloading from the CommonCrawl Endpoint without being temporarily blocked with HTTP 403's.</li>
+  <li>A text file with all of the CommonCrawl links named "urls_to_download.txt" is required to be located in /opt/.</li>
   <li>I took the lazy way of setting up FTP on the server by allowing the root account SSH access to the server without a private key.</li>
   <li>WinSCP Automation is required to be downloaded and copied over the top of the installed WinSCP installation directory for the script to work.</li>
 </ul>
\ No newline at end of file