kennethreitz/requests-html
Requests-HTML: HTML Parsing for Folks™
This library intends to receive parsing HTML (e.g. scraping the catch) as
easy and intuitive as that you might maybe well perhaps well factor in.
When using this library you robotically receive:
- CSS Selectors (a.okay.a jQuery-fashion, attributable to PyQuery).
- XPath Selectors, for the faint at heart.
- Mocked person-agent (admire a accurate web browser).
- Computerized following of redirects.
- Connection–pooling and cookie persistience.
- The Requests skills you perceive and care for, with magic parsing abilities.
Varied glorious aspects encompass:
- Markdown export of pages and parts.
Utilization
Carry out a GET quiz to ‘python.org’, using Requests:
>>> from requests_html import session
>>> r = session.receive('https://python.org/')
Take hold of a checklist of all hyperlinks on the page, as–is (anchors excluded):
>>> r.html.hyperlinks
{'/customers/membership/', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~three/zVC80sq9s00/python-364-is-now-on hand.html', '/about/success/', 'http://flask.pocoo.org/', 'http://www.djangoproject.com/', '/blogs/', ... '/psf-landing/', 'https://wiki.python.org/moin/PythonBooks'}
Take hold of a checklist of all hyperlinks on the page, in absolute develop (anchors excluded):
>>> r.html.absolute_links
{'http://feedproxy.google.com/~r/PythonInsider/~three/zVC80sq9s00/python-364-is-now-on hand.html', 'https://www.python.org/downloads/mac-osx/', 'http://flask.pocoo.org/', 'https://www.python.org/docs.python.org/three/tutorial/', 'http://www.djangoproject.com/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/about/success/', 'http://twitter.com/ThePSF', 'https://www.python.org/occasions/python-person-neighborhood/634/', ..., 'https://wiki.python.org/moin/PythonBooks'}
Decide a suppose with a CSS Selector:
>>> about = r.html.fetch('#about', first=Lawful)
Take hold of a suppose’s text contents:
>>> print(about.text)
About
Options
Quotes
Getting Began
Support
Python Brochure
Introspect an Ingredient’s attributes:
>>> about.attrs
{'identity': 'about', 'class': 'tier-1 part-1 ', 'aria-haspopup': 'accurate'}
Decide Options within Options:
>>> about.fetch('a')
[, , , , , ]
Render an Ingredient as Markdown:
>>> print(about.markdown)
* [About](/about/)
* [Applications](/about/apps/)
* [Quotes](/about/quotes/)
* [Getting Started](/about/gettingstarted/)
* [Help](/about/serve/)
* [Python Brochure](http://brochure.getpython.records/)
See text on the page:
>>> r.html.search('Python is a {} language')[zero]
programming
More advanced CSS Selector instance (copied from Chrome dev tools):
>>> r = session.receive('https://github.com/')
>>> sel = 'physique > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-heart.text-md-left > p'
>>> print(r.html.fetch(sel, first=Lawful).text)
GitHub is a construction platform inspired by the procedure you're employed. From commence supply to alternate, you might maybe well perhaps well host and evaluation code, organize projects, and manufacture machine alongside thousands of 1000's of heaps of builders.
XPath is also supported:
>>> r.html.xpath('a')
[]
Installation
$ pipenv set up requests-html
✨🍰✨
Read More
Commentaires récents