github.com-whymirror-hpricot_-_2012-03-16_19-47-16
Item Preview
Share or Embed This Item
Flag this item for
- Publication date
- 2012-03-16
A swift, liberal HTML parser with a fantastic library
Hpricot, Read Any HTML
Hpricot is a fast, flexible HTML parser written in C. It's designed to be veryaccommodating (like Tanaka Akira's HTree) and to have a very helpful library(like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSSparser, in fact, is based on John Resig's JQuery.
Also, Hpricot can be handy for reading broken XML files, since many of the sametechniques can be used. If a quote is missing, Hpricot tries to figure it out.If tags overlap, Hpricot works on sorting them out. You know, that sort ofthing.
Please read this entire document before making assumptions about how thissoftware works.
An Overview
Let's clear up what Hpricot is.
- Hpricot is a standalone library. It requires no other libraries. Just Ruby!
- While priding itself on speed, Hpricot works hard to sort out bad HTML andpays a small penalty in order to get that right. So that's slightly more importantto me than speed.
- If you can see it in Firefox, then Hpricot should parse it. That'show it should be! Let me know the minute it's otherwise.
- Primarily, Hpricot is used for reading HTML and tries to sort out troubledHTML by having some idea of what good HTML is. Some people still like to useHpricot for XML reading, but remember to use the Hpricot::XML() method for that!
The Hpricot Kingdom
First, here are all the links you need to know:
- http://wiki.github.com/hpricot/hpricot is the Hpricot wiki andhttp://github.com/hpricot/hpricot/issues is the bug tracker.Go there for news and recipes and patches. It's the center of activity.
- http://github.com/hpricot/hpricot is the main Gitrepository for Hpricot. You can get the latest code there.
- See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
If you have any trouble, don't hesitate to contact the author. As always, I'mnot going to say "Use at your own risk" because I don't want this library to berisky. If you trip on something, I'll share the liability by repairing thingsas quickly as I can. Your responsibility is to report the inadequacies.
Installing Hpricot
You may get the latest stable version from Rubyforge. Win32 binaries,Java binaries (for JRuby), and source gems are available.
$ gem install hpricot
An Hpricot Showcase
We're going to run through a big pile of examples to get you jump-started.Many of these examples are also found athttp://wiki.github.com/hpricot/hpricot/hpricot-basics, in case youwant to add some of your own.
Loading Hpricot Itself
You have probably got the gem, right? To load Hpricot:
require 'rubygems'require 'hpricot'
If you've installed the plain source distribution, go ahead and just:
require 'hpricot'
Load an HTML Page
The Hpricot() method takes a string or any IO object and loads thecontents into a document object.
doc = Hpricot("
A simple test string.
")To load from a file, just get the stream open:
doc = open("index.html") { |f| Hpricot(f) }
To load from a web URL, use open-uri, which comes with Ruby:
require 'open-uri'doc = open("http://qwantz.com/") { |f| Hpricot(f) }
Hpricot uses an internal buffer to parse the file, so the IO will streamproperly and large documents won't be loaded into memory all at once. However,the parsed document object will be present in memory, in its entirety.
Search for Elements
Use Doc.search:
doc.search("//p[@class='posted']")#=> #
Doc.search can take an XPath or CSS expression. In the above example,all paragraph
elements are grabbed which have a classattribute of "posted".
A shortcut is to use the divisor:
(doc/"p.posted")#=> #
Finding Just One Element
If you're looking for a single element, the at method will return thefirst element matched by the expression. In this case, you'll get back theelement itself rather than the Hpricot::Elements array.
doc.at("body")['onload']
The above code will find the body tag and give you back the onloadattribute. This is the most common reason to use the element directly: whenreading and writing HTML attributes.
Fetching the Contents of an Element
Just as with browser scripting, the inner_html property can be used toget the inner contents of an element.
(doc/"#elementID").inner_html#=> "..contents.."
If your expression matches more than one element, you'll get back the contentsof ''all the matched elements''. So you may want to use first to besure you get back only one.
(doc/"#elementID").first.inner_html#=> "..contents.."
Fetching the HTML for an Element
If you want the HTML for the whole element (not just the contents), useto_html:
(doc/"#elementID").to_html#=> "
Looping
All searches return a set of Hpricot::Elements. Go ahead and loopthrough them like you would an array.
(doc/"p/a/img").each do |img| puts img.attributes['class']end
Continuing Searches
Searches can be continued from a collection of elements, in order to search deeper.
# find all paragraphs.elements = doc.search("/html/body//p")# continue the search by finding any images within those paragraphs.(elements/"img")#=> #
Searches can also be continued by searching within container elements.
# find all images within paragraphs.doc.search("/html/body//p").each do |para| puts "== Found a paragraph ==" pp para imgs = para.search("img") if imgs.any? puts "== Found #{imgs.length} images inside ==" endend
Of course, the most succinct ways to do the above are using CSS or XPath.
# the xpath version(doc/"/html/body//p//img")# the css version(doc/"html > body > p img")# ..or symbols work, too!(doc/:html/:body/:p/:img)
Looping Edits
You may certainly edit objects from within your search loops. Then, when youspit out the HTML, the altered elements will show.
(doc/"span.entryPermalink").each do |span| span.attributes['class'] = 'newLinks'endputs doc
This changes all span.entryPermalink elements tospan.newLinks. Keep in mind that there are often more convenient waysof doing this. Such as the set method:
(doc/"span.entryPermalink").set(:class => 'newLinks')
Figuring Out Paths
Every element can tell you its unique path (either XPath or CSS) to get to theelement from the root tag.
The css_path method:
doc.at("div > div:nth(1)").css_path #=> "div > div:nth(1)" doc.at("#header").css_path #=> "#header"
Or, the xpath method:
doc.at("div > div:nth(1)").xpath #=> "/div/div:eq(1)" doc.at("#header").xpath #=> "//div[@id='header']"
Hpricot Fixups
When loading HTML documents, you have a few settings that can make Hpricot moreor less intense about how it gets involved.
:fixup_tags
Really, there are so many ways to clean up HTML and your intentions may be tokeep the HTML as-is. So Hpricot's default behavior is to keep things flexible.Making sure to open and close all the tags, but ignore any validation problems.
As of Hpricot 0.4, there's a new :fixup_tags option which will attemptto shift the document's tags to meet XHTML 1.0 Strict.
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to followthe rules a bit better. Like: say Hpricot finds a paragraph in a link, it'sgoing to move the paragraph below the link. Or up and out of other elementswhere paragraphs don't belong.
If an unknown element is found, it is ignored. Again, :fixup_tags.
:xhtml_strict
So, let's go beyond just trying to fix the hierarchy. The:xhtml_strict option really tries to force the document to be an XHTML1.0 Strict document. Even at the cost of removing elements that get in the way.
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
What measures does :xhtml_strict take?
- Shift elements into their proper containers just like :fixup_tags.
- Remove unknown elements.
- Remove unknown attributes.
- Remove illegal content.
- Alter the doctype to XHTML 1.0 Strict.
Hpricot.XML()
The last option is the :xml option, which makes some slight variationson the standard mode. The main difference is that :xml mode won't try to outputtags which are friendlier for browsers. For example, if an opening and closingbr tag is found, XML mode won't try to turn that into an empty element.
XML mode also doesn't downcase the tags and attributes for you. So pay attentionto case, friends.
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
doc = open("http://redhanded.hobix.com/index.xml") do |f| Hpricot.XML(f)end
Also, :fixup_tags is canceled out by the :xml option. This is because:fixup_tags makes assumptions based how HTML is structured. Specifically, howtags are defined in the XHTML 1.0 DTD.
To restore the repository download the bundle whymirror-hpricot_-_2012-03-16_19-47-16.bundle and run:
git clone whymirror-hpricot_-_2012-03-16_19-47-16.bundle -b master
Source: https://github.com/whymirror/hpricot
Uploader: whymirror
Upload date: 2012-03-16
- Addeddate
- 2017-05-30 18:05:28
- Identifier
- github.com-whymirror-hpricot_-_2012-03-16_19-47-16
- Originalurl
-
https://github.com/whymirror/hpricot
- Pushed_date
- 2012-03-16 19:47:16
- Scanner
- Internet Archive Python library 1.5.0
- Uploaded_with
- iagitup - v1.0
- Year
- 2012