How to Scrape Email Addresses from a Website using Python?

How to Scrape Email Addresses from a Website using Python?

if ou want to srcap an email address you need to follow this easy processess

Step 1.  Import modules

We import six modules for this project.https://medium.com/media/190db1be5574ef52a21ca31c436e7be0Import modules we need

  1. re for regular expression matching operations
  2. requests for sending HTTP requests
  3. urlsplit for breaking URLs down into component parts
  4. deque is a list-like container with fast appends and pops on either end
  5. BeautifulSoup for pulling data out of HTML files of websites
  6. pandas for formatting emails into a DataFrame for further manipulation

Step 2: Initialize variables

Then, we initialize a deque for saving unscraped URLs, a set for scraped URLs, and a set for saving emails scraped successfully from the website.https://medium.com/media/9f25d5dd74a1ab10e47f1fbe21da8935Initialize variables

Elements in Set are unique. Duplicate elements are not allowed.

Step 3: Start scraping

  1. First, move a url from unscrapedto scraped.

https://medium.com/media/93561f2e37b42184dff069b65c729c67unscrpaed_url to scraped_url

2. Then we use urlsplit to extract different parts of the url.https://medium.com/media/9810a677ed62b797601eac9cef1cbda7urlsplit()

urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier).

Sample input & output for urlsplit()https://medium.com/media/6d5e1a35e37bac7bd2545c2a0045473eSample for urlsplit()

In such a way, we are able to get the base and path part for the website URL.https://medium.com/media/da712c64e3a26db78a05f9d0c4836e5f

3. Sending an HTTP GET request to the website.https://medium.com/media/827bd27ebee0f47e0c1c699420d282d1Sending HTTP requests

4. Extract all email addresses from the response using a regular expression, and add them into the email set.https://medium.com/media/5e1d3f2bc68f4a74ea45a3a79207e2c2Extract emails using regular expression

If you are not familiar with Python regular regression, check Python RegEx for more information.

5. Find all linked URLs in the website.

To do so, we first need to create a Beautiful Soup to parse the HTML document.https://medium.com/media/e4aad2c330b2faf09d4afe9ac22ee2a6Create a Beautiful Soup for the HTML document

Then we can find all the linked URLs in the document by finding the <a href=””> tag which indicates a hyperlink.https://medium.com/media/1f35bcfaa1c1338df13c5170857a5f1fFind all linked URLs

Add the new url to the unscraped queue if it was not in unscraped nor in scraped yet.

We also need to exclude links like http://www.medium.com/file.gz that are unable to be scraped.https://medium.com/media/44773f5bd9a7baaca754275c9ee6e373Add new URLs

Step 4: Export emails to a CSV file

After successfully scraping emails from the website, we can export the emails to a CSV file.https://medium.com/media/626782c9415fb20db4d0b746be21a768Export emails to a csv file

If you are using Google Colaboratory, you can download the file to the local machine by:https://medium.com/media/5e41fe57d7b6e28e47bb765aded1f47bDownload from Colaboratory

Sample output CSV file:

Image for post
Sample output

Complete Code

https://medium.com/media/d25842dfff91a7f1101274741e5213c3Complete Code

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
original_url = input(“Enter the website url: “)
unscraped = deque([original_url])
scraped = set()
emails = set()
while len(unscraped):
url = unscraped.popleft()
scraped.add(url)
parts = urlsplit(url)
base_url = “{0.scheme}://{0.netloc}”.format(parts)
if ‘/’ in parts.path:
path = url[:url.rfind(‘/’)+1]
else:
path = url
print(“Crawling URL %s” % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
new_emails = set(re.findall(r”[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com”, response.text, re.I))
emails.update(new_emails)
soup = BeautifulSoup(response.text, ‘lxml’)
for anchor in soup.find_all(“a”):
if “href” in anchor.attrs:
link = anchor.attrs[“href”]
else:
link = ”
if link.startswith(‘/’):
link = base_url + link
elif not link.startswith(‘http’):
link = path + link
if not link.endswith(“.gz”):
if not link in unscraped and not link in scraped:
unscraped.append(link)
df = pd.DataFrame(emails, columns=[“Email”])
df.to_csv(’email.csv’, index=False)
files.download(“email.csv”)
complete code

Reference

[1] Crawling all emails from a websiteThe Startup

Medium’s largest active publication, followed by +743K people. Follow to join our community.

Follow

216

5

Be the first to comment

Leave a Reply

Your email address will not be published.


*