Update README.md
This commit is contained in:
parent
bc9e5971a6
commit
4930a9418d
40
README.md
40
README.md
@ -1,21 +1,21 @@
|
|||||||
<b>Overview:</b>
|
<b>Overview:</b>
|
||||||
|
|
||||||
Use the following scripts to extract urls from .txt.gz files and output to a txt file.
|
Use the following scripts to extract urls from .txt.gz files and output to a txt file.
|
||||||
|
|
||||||
<b>Requirements:</b>
|
<b>Requirements:</b>
|
||||||
<ul>
|
<ul>
|
||||||
<li>Python3</li>
|
<li>Python3</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<b>Steps:</b>
|
<b>Steps:</b>
|
||||||
|
|
||||||
1. Git Clone the Repository.
|
1. Git Clone the Repository.
|
||||||
2. Run url_extractor.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output and the concurrency to run the script at.
|
2. Run url_extractor.py against the directory of txt.gz files you are wanting to process. The script will ask for you to enter the location of the files, where you want to store the output, the keyword/url you are wanting to export and the concurrency to run the script at.
|
||||||
3. Once the script has completed running, verify the output of the file.
|
3. Once the script has completed running, verify the output of the file.
|
||||||
|
|
||||||
<b>Notes:</b>
|
<b>Notes:</b>
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
<li>The script is hardset to stream the txt files from the location and process it line by line migitating the need for large files to be loaded into RAM.</li>
|
<li>The script is hardset to stream the txt files from the location and process it line by line migitating the need for large files to be loaded into RAM.</li>
|
||||||
<li>Running the script over the network is fine and performance appears to not be impacted. When running against CommonCrawl WAT files it has been identified that a 1Gbit link can be saturated with 12 concurrent processes on an i7. The CPU still has additional capacity.</li>
|
<li>Running the script over the network is fine and performance appears to not be impacted. When running against CommonCrawl WAT files it has been identified that a 1Gbit link can be saturated with 12 concurrent processes on an i7. The CPU still has additional capacity.</li>
|
||||||
</ul>
|
</ul>
|
Loading…
Reference in New Issue
Block a user