
How to Scrape Email Addresses from a Website using Python?
How to Scrape Email Addresses from a Website using Python?
if ou want to srcap an email address you need to follow this easy processess
Step 1. Import modules
We import six modules for this project.https://medium.com/media/190db1be5574ef52a21ca31c436e7be0Import modules we need
re
for regular expression matching operationsrequests
for sending HTTP requestsurlsplit
for breaking URLs down into component partsdeque
is a list-like container with fast appends and pops on either endBeautifulSoup
for pulling data out of HTML files of websitespandas
for formatting emails into a DataFrame for further manipulation
Step 2: Initialize variables
Then, we initialize a deque for saving unscraped URLs, a set for scraped URLs, and a set for saving emails scraped successfully from the website.https://medium.com/media/9f25d5dd74a1ab10e47f1fbe21da8935Initialize variables
Elements in Set
are unique. Duplicate elements are not allowed.
Step 3: Start scraping
- First, move a
url
fromunscraped
toscraped
.
https://medium.com/media/93561f2e37b42184dff069b65c729c67unscrpaed_url to scraped_url
2. Then we use urlsplit
to extract different parts of the url
.https://medium.com/media/9810a677ed62b797601eac9cef1cbda7urlsplit()
urlsplit()
returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier).
Sample input & output for urlsplit()
https://medium.com/media/6d5e1a35e37bac7bd2545c2a0045473eSample for urlsplit()
In such a way, we are able to get the base and path part for the website URL.https://medium.com/media/da712c64e3a26db78a05f9d0c4836e5f
3. Sending an HTTP GET request to the website.https://medium.com/media/827bd27ebee0f47e0c1c699420d282d1Sending HTTP requests
4. Extract all email addresses from the response using a regular expression, and add them into the email
set.https://medium.com/media/5e1d3f2bc68f4a74ea45a3a79207e2c2Extract emails using regular expression
If you are not familiar with Python regular regression, check Python RegEx for more information.
5. Find all linked URLs in the website.
To do so, we first need to create a Beautiful Soup to parse the HTML document.https://medium.com/media/e4aad2c330b2faf09d4afe9ac22ee2a6Create a Beautiful Soup for the HTML document
Then we can find all the linked URLs in the document by finding the <a href=””>
tag which indicates a hyperlink.https://medium.com/media/1f35bcfaa1c1338df13c5170857a5f1fFind all linked URLs
Add the new url
to the unscraped
queue if it was not in unscraped
nor in scraped
yet.
We also need to exclude links like http://www.medium.com/file.gz
that are unable to be scraped.https://medium.com/media/44773f5bd9a7baaca754275c9ee6e373Add new URLs
Step 4: Export emails to a CSV file
After successfully scraping emails from the website, we can export the emails to a CSV file.https://medium.com/media/626782c9415fb20db4d0b746be21a768Export emails to a csv file
If you are using Google Colaboratory, you can download the file to the local machine by:https://medium.com/media/5e41fe57d7b6e28e47bb765aded1f47bDownload from Colaboratory
Sample output CSV file:

Complete Code
https://medium.com/media/d25842dfff91a7f1101274741e5213c3Complete Code
import re | |
import requests | |
from urllib.parse import urlsplit | |
from collections import deque | |
from bs4 import BeautifulSoup | |
import pandas as pd | |
from google.colab import files | |
original_url = input(“Enter the website url: “) | |
unscraped = deque([original_url]) | |
scraped = set() | |
emails = set() | |
while len(unscraped): | |
url = unscraped.popleft() | |
scraped.add(url) | |
parts = urlsplit(url) | |
base_url = “{0.scheme}://{0.netloc}”.format(parts) | |
if ‘/’ in parts.path: | |
path = url[:url.rfind(‘/’)+1] | |
else: | |
path = url | |
print(“Crawling URL %s” % url) | |
try: | |
response = requests.get(url) | |
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): | |
continue | |
new_emails = set(re.findall(r”[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com”, response.text, re.I)) | |
emails.update(new_emails) | |
soup = BeautifulSoup(response.text, ‘lxml’) | |
for anchor in soup.find_all(“a”): | |
if “href” in anchor.attrs: | |
link = anchor.attrs[“href”] | |
else: | |
link = ” | |
if link.startswith(‘/’): | |
link = base_url + link | |
elif not link.startswith(‘http’): | |
link = path + link | |
if not link.endswith(“.gz”): | |
if not link in unscraped and not link in scraped: | |
unscraped.append(link) | |
df = pd.DataFrame(emails, columns=[“Email”]) | |
df.to_csv(’email.csv’, index=False) | |
files.download(“email.csv”) |
Reference
[1] Crawling all emails from a websiteThe Startup
Medium’s largest active publication, followed by +743K people. Follow to join our community.
216
5
-
GAMES4 months ago
Download FIFA PC Game Windows For Free Download
-
TECHNOLOGY5 months ago
Download PUBG Mobile KR for PC – Installation On Windows & MAC
-
TECHNOLOGY6 months ago
Download PES 2020 ISO PPSSPP efootball English Version
-
GAMES2 months ago
Download PUBG MOBILE (KR) 2021 APK For PC For Free
-
Uncategorized2 months ago
Download (Official) Latest Version Of GBWhatsApp APK Anti-Ban
-
GAMES2 months ago
Download PES 2017 Patch PES 2021 V1.0 PC Game
-
TECHNOLOGY6 months ago
Review Of IPhone 11 Pro Long Term
-
TECHNOLOGY6 months ago
BATTERY DRAIN TEST – Xiaomi Mi 10 Ultra vs Samsung Note 20 Ultra vs S20 Ultra vs Redmi K30 Ultra