diff --git a/README.md b/README.md index 3277ad0..92b9bdc 100644 --- a/README.md +++ b/README.md @@ -11,10 +11,14 @@ Depending on the types of URL's that are being processed you will either need to Steps: -1. Git Clone the Repository +1. Git Clone the Repository. 2. Run blogger_url_cleaner.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at. -3. +3. Once the script has completed running, verify the output and if there are many blogger image links run blogger_remove_img_lines.py. +4. Run blogger_remove_img_lines.py against the directory containing the newly created output from step 2. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at. Notes: -The script is hardset to stream the test files from the location and process it line by line migitating the need for large files to be loaded into RAM. \ No newline at end of file + \ No newline at end of file