From b24287ef6ff96e80d335f9d154b9fb4d38efc0bb Mon Sep 17 00:00:00 2001 From: datechnoman Date: Tue, 12 Dec 2023 10:23:37 +0000 Subject: [PATCH] Upload files to "/" --- README.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..1218ae0 --- /dev/null +++ b/README.md @@ -0,0 +1,33 @@ +Overview: + +Use the following set of scripts to extract urls from CommonCrawl WAT files and output to a compressed txt file. + +This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.) + +The script can be run without the second part but the server will quickly fill if large data sets are being processed (eg; CommonCrawl WARC/WAT Files). + +Requirements: + + +Pre-Setup Steps: +Before running the scripts, there are some steps required to setup the stack. + +A script has been devised to automate the steps. + +Run "prerequisites.sh" to setup the stack. + +Steps: + +1. Run url_extractor.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at. +2. Once the script has completed running, verify the output of the file. + +Notes: + + \ No newline at end of file