Upload files to "/"

This commit is contained in:
datechnoman 2023-12-12 09:27:23 +00:00
parent 718cee5e8a
commit 72c85f86c6

View File

@ -3,3 +3,18 @@
Use the following scripts to extract urls from .txt.gz files and output to a txt file. Use the following scripts to extract urls from .txt.gz files and output to a txt file.
Depending on the types of URL's that are being processed you will either need to only use "blogger_url_clearner.py" (plainly extract the urls from a file) or also use "blogger_remove_img_lines.py" which will read the txt file and output all lines that do not contain "jpg|png|gif|jpeg" Depending on the types of URL's that are being processed you will either need to only use "blogger_url_clearner.py" (plainly extract the urls from a file) or also use "blogger_remove_img_lines.py" which will read the txt file and output all lines that do not contain "jpg|png|gif|jpeg"
<b>Requirements:</b>
<ul>
<li>Python3</li>
</ul>
<b>Steps:</b>
1. Git Clone the Repository
2. Run blogger_url_cleaner.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at.
3.
<b>Notes:</b>
The script is hardset to stream the test files from the location and process it line by line migitating the need for large files to be loaded into RAM.