Parsing html document is one of very useful feature for web crawler when you only need crawl a specific element content on the web page. In go 1.4.0, a html parser in golang.org/x/net/html. In this post, I will state how to use the “html” library to parse a html document.
As my computer still using go 1.3.1, there is no such library. So I have to install the library manually by (run by command line):
go get golang.org/x/net/html
Now, lets get a html parse object in go.
- Read the file. Depending on the source, there are many different ways. In this post, I will use the html from Internet. What you need to do is to download the page by:
In this piece of code, the html will be downloaded regarding of the given url and return its content as a byte array. The code of string(contents)
is used to convert the byte array to a string. If there is an error occured when downloading the web content, the error will be record and a nil is returned.
- Turn a byte array to a *Reader. In order to turn the byte array to a reader:
Here now, we can get a object for parsing.
- Parse the document to *html.Node. The code is following:
Now, we have a parsed object doc
.