go

Simple Tutorial of Go html parser

Posted by Yan on February 24, 2015

Parsing html document is one of very useful feature for web crawler when you only need crawl a specific element content on the web page. In go 1.4.0, a html parser in golang.org/x/net/html. In this post, I will state how to use the “html” library to parse a html document.

github

As my computer still using go 1.3.1, there is no such library. So I have to install the library manually by (run by command line):

go get golang.org/x/net/html

Now, lets get a html parse object in go.

  1. Read the file. Depending on the source, there are many different ways. In this post, I will use the html from Internet. What you need to do is to download the page by:
package htmlparser

import (
	"log"
	"net/http"
	"io/ioutil"
)


func DowloadByUrl(url string) []byte {
	log.Println("download start download url: ", url)
	response, err := http.Get(url)
	log.Println("download finished url: ", url)

	if err != nil {
		log.Printf("%s", err)
	} else {
		defer response.Body.Close()
		contents, err := ioutil.ReadAll(response.Body)
		if err != nil {
			log.Printf("%s", err)
		}
		log.Printf("%s\n", string(contents))
		return contents
	}
	return nil
}

In this piece of code, the html will be downloaded regarding of the given url and return its content as a byte array. The code of string(contents) is used to convert the byte array to a string. If there is an error occured when downloading the web content, the error will be record and a nil is returned.

  1. Turn a byte array to a *Reader. In order to turn the byte array to a reader:
	import "bytes"

	...
	reader := bytes.NewReader(/*your byte array here*/)
	...

Here now, we can get a object for parsing.

  1. Parse the document to *html.Node. The code is following:
	import "golang.org/x/net/html"

	...
	doc, err := html.Parse(reader)
	if err != nil {
		log.Fatal(err)
	}
    	...

Now, we have a parsed object doc.