Getting started with Python Language List comprehensions Filter List Functions Decorators Math Module Loops Random module Comparisons Importing modules Sorting, Minimum and Maximum Operator module Variable Scope and Binding Basic Input and Output Files & Folders I/O JSON Module String Methods Metaclasses Indexing and Slicing Generators Simple Mathematical Operators Reduce Map Function Exponentiation Searching Dictionary Classes Counting Manipulating XML Date and Time Set Collections module Parallel computation Multithreading Writing extensions Unit Testing Regular Expressions (Regex)Bitwise Operators Incompatibilities moving from Python 2 to Python 3 Virtual environments Copying data Tuple Context Managers (“with” Statement)Hidden Features Enum String Formatting Conditionals Complex math Unicode and bytes The __name__ special variable Checking Path Existence and Permissions Python Networking Asyncio Module The Print Function os.path Creating Python packages Parsing Command Line arguments HTML Parsing Subprocess Library setup.py List slicing (selecting parts of lists)Sockets Itertools Module Recursion Boolean Operators The dis module Type Hints pip: PyPI Package Manager The locale Module Exceptions Web scraping with Python Deque Module Distribution Property Objects Overloading Debugging Reading and Writing CSV Dynamic code execution with `exec` and `eval`PyInstaller - Distributing Python Code Iterables and Iterators Data Visualization with Python The Interpreter (Command Line Console)*args and **kwargs Functools Module Garbage Collection Indentation Security and Cryptography Pickle data serialisation urllib Binary Data Python and Excel Idioms Method Overriding Difference between Module and Package Data Serialization Python concurrency Introduction to RabbitMQ using AMQPStorm PostgreSQL Descriptor Common Pitfalls Multiprocessing tempfile NamedTemporaryFile Working with ZIP archives Stack Profiling User-Defined Methods Working around the Global Interpreter Lock (GIL)Deployment Logging Processes and Threads The os Module Comments and Documentation Database Access Python HTTP Server Alternatives to switch statement from other languages List destructuring (aka packing and unpacking)Accessing Python source code and bytecode Mixins Attribute Access ArcPy Python Anti-Patterns Plugin and Extension Classes Websockets Immutable datatypes(int, float, str, tuple and frozensets)String representations of class instances: __str__ and __repr__ methods Arrays Operator Precedence Polymorphism Non-official Python implementations List Comprehensions Web Server Gateway Interface (WSGI)2to3 tool Abstract syntax tree Abstract Base Classes (abc)Unicode Secure Shell Connection in Python Python Serial Communication (pyserial)Neo4j and Cypher using Py2Neo Basic Curses with Python Performance optimization Templates in python Pillow The pass statement Linked List Node py.test Date Formatting Heapq tkinter CLI subcommands with precise help output Defining functions with list arguments Sqlite3 Module Python Persistence Turtle Graphics Connecting Python to SQL Server Design Patterns Multidimensional arrays Audio Pyglet Queue Module ijson Webbrowser Module The base64 Module Flask groupby()Sockets And Message Encryption/Decryption Between Client and Server pygame Input, Subset and Output External Data Files using Pandas hashlib getting start with GZip Django ctypes Creating a Windows service using Python Python Server Sent Events Mutable vs Immutable (and Hashable) in Python Python speed of program configparser Linked lists Commonwealth Exceptions Optical Character Recognition Python Data Types Partial functions pyautogui module graph-tool Unzipping Files Functional Programming in Python Python Virtual Environment - virtualenv sys virtual environment with virtualenvwrapper Create virtual environment with virtualenvwrapper in windows Python Requests Post Plotting with Matplotlib Python Lex-Yacc ChemPy - python package pyaudio shelve Usage of "pip" module: PyPI Package Manager IoT Programming with Python and Raspberry PI Code blocks, execution frames, and namespaces kivy - Cross-platform Python Framework for NUI Development Call Python from C#Similarities in syntax, Differences in meaning: Python vs. JavaScript Writing to CSV from String or List Raise Custom Errors / Exceptions Using loops within functions Pandas Transform: Preform operations on groups and concatenate the results

Web scraping with Python

Remarks:

Useful Python packages for web scraping (alphabetical order)

Making requests and collecting data

`requests`

A simple, but powerful package for making HTTP requests.

Caching for requests; caching data is very useful. In development, it means you can avoid hitting a site unnecessarily. While running a real collection, it means that if your scraper crashes for some reason (maybe you didn't handle some unusual content on the site...? maybe the site went down...?) you can repeat the collection very quickly from where you left off.

`scrapy`

Useful for building web crawlers, where you need something more powerful than using requests and iterating through pages.

`selenium`

Python bindings for Selenium WebDriver, for browser automation. Using requests to make HTTP requests directly is often simpler for retrieving webpages. However, this remains a useful tool when it is not possible to replicate the desired behaviour of a site using requests alone, particularly when JavaScript is required to render elements on a page.

HTML parsing

`BeautifulSoup`

Query HTML and XML documents, using a number of different parsers (Python's built-in HTML Parser,html5lib, lxml or lxml.html)

`lxml`

Processes HTML and XML. Can be used to query and select content from HTML documents via CSS selectors and XPath.

Basic example of using requests and lxml to scrape some data

# For Python 2 compatibility.
from __future__ import print_function

import lxml.html
import requests


def main():
    r = requests.get("https://httpbin.org")
    html_source = r.text
    root_element = lxml.html.fromstring(html_source)
    # Note root_element.xpath() gives a *list* of results.
    # XPath specifies a path to the element we want.
    page_title = root_element.xpath('/html/head/title/text()')[0]
    print(page_title)

if __name__ == '__main__':
    main()

Maintaining web-scraping session with requests

It is a good idea to maintain a web-scraping session to persist the cookies and other parameters. Additionally, it can result into a performance improvement because requests.Session reuses the underlying TCP connection to a host:

import requests

with requests.Session() as session:
    # all requests through session now have User-Agent header set
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

    # set cookies
    session.get('http://httpbin.org/cookies/set?key=value')

    # get cookies
    response = session.get('http://httpbin.org/cookies')
    print(response.text)

Scraping using the Scrapy framework

First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject projectName

To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questions on StackOverflow and scrapes some data from each page (source):

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'  # each spider has a unique name
    start_urls = ['http://stackoverflow.com/questions?sort=votes']  # the parsing starts from a specific set of urls

    def parse(self, response):  # for each request this generator yields, its response is sent to parse_question
        for href in response.css('.question-summary h3 a::attr(href)'):  # do some scraping stuff using css selectors to find question urls 
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response): 
        yield {
            'title': response.css('h1 a::text').extract_first(),
            'votes': response.css('.question .vote-count-post::text').extract_first(),
            'body': response.css('.question .post-text').extract_first(),
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

Save your spider classes in the projectName\spiders directory. In this case - projectName\spiders\stackoverflow_spider.py.

Now you can use your spider. For example, try running (in the project's directory):

scrapy crawl stackoverflow

Modify Scrapy user agent

Sometimes the default Scrapy user agent ("Scrapy/VERSION (+http://scrapy.org)") is blocked by the host. To change the default user agent open settings.py, uncomment and edit the following line to what ever you want.

#USER_AGENT = 'projectName (+http://www.yourdomain.com)'

For example

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'

Scraping using BeautifulSoup4

from bs4 import BeautifulSoup
import requests

# Use the requests module to obtain a page
res = requests.get('https://www.codechef.com/problems/easy')

# Create a BeautifulSoup object
page = BeautifulSoup(res.text, 'lxml')   # the text field contains the source of the page

# Now use a CSS selector in order to get the table containing the list of problems
datatable_tags = page.select('table.dataTable')  # The problems are in the <table> tag,
                                                 # with class "dataTable"
# We extract the first tag from the list, since that's what we desire
datatable = datatable_tags[0]
# Now since we want problem names, they are contained in <b> tags, which are
# directly nested under <a> tags
prob_tags = datatable.select('a > b')
prob_names = [tag.getText().strip() for tag in prob_tags]

print prob_names

Scraping using Selenium WebDriver

Some websites don’t like to be scraped. In these cases you may need to simulate a real user working with a browser. Selenium launches and controls a web browser.

from selenium import webdriver

browser = webdriver.Firefox()  # launch firefox browser

browser.get('http://stackoverflow.com/questions?sort=votes')  # load url

title = browser.find_element_by_css_selector('h1').text  # page title (first h1 element)

questions = browser.find_elements_by_css_selector('.question-summary')  # question list

for question in questions:  # iterate over questions
    question_title = question.find_element_by_css_selector('.summary h3 a').text
    question_excerpt = question.find_element_by_css_selector('.summary .excerpt').text
    question_vote = question.find_element_by_css_selector('.stats .vote .votes .vote-count-post').text
    
    print "%s\n%s\n%s votes\n-----------\n" % (question_title, question_excerpt, question_vote)

Selenium can do much more. It can modify browser’s cookies, fill in forms, simulate mouse clicks, take screenshots of web pages, and run custom JavaScript.

Simple web content download with urllib.request

The standard library module urllib.request can be used to download web content:

from urllib.request import urlopen

response = urlopen('http://stackoverflow.com/questions?sort=votes')    
data = response.read()

# The received bytes should usually be decoded according the response's character set
encoding = response.info().get_content_charset()
html = data.decode(encoding)

A similar module is also available in Python 2.

Scraping with curl

imports:

from subprocess import Popen, PIPE
from lxml import etree
from io import StringIO

Downloading:

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'http://stackoverflow.com'
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8')

-s: silent download

-A: user agent flag

Parsing:

tree = etree.parse(StringIO(result), etree.HTMLParser())
divs = tree.xpath('//div')

Contributors

Topic Id: 1792

Example Ids: 5840,8152,12537,14690,17405,18119,19297,27421

This site is not affiliated with any of the contributors.