diff --git a/README.md b/README.md new file mode 100644 index 0000000..1218ae0 --- /dev/null +++ b/README.md @@ -0,0 +1,33 @@ +Overview: + +Use the following set of scripts to extract urls from CommonCrawl WAT files and output to a compressed txt file. + +This script stack compromises of two processes running concurrently (One on the URL Extractor Server downloading, processing and zipping the results and the other stack pulling the files via rsync to a destination of your choice.) + +The script can be run without the second part but the server will quickly fill if large data sets are being processed (eg; CommonCrawl WARC/WAT Files). + +Requirements: +