Unable to parse HTML #78

harshzalavadiya · 2020-07-27T14:34:48Z

So I was trying to parse content from multiple document formats and turns out it works for other document formats pdf, doc etc. but not for html files somehow

below is the minimal example with sample html

main.go

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	// Attempt to read file
	txt, err := docconv.ConvertPath("test.html")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(txt.Body)
}

test.html

<!DOCTYPE html>
<html>
  <body>
    <h1>This is heading 1</h1>
    <h2>This is heading 2</h2>
    <h3>This is heading 3</h3>
    <h4>This is heading 4</h4>
    <h5>This is heading 5</h5>
    <h6>This is heading 6</h6>
  </body>
</html>

As of now output is blank

also I noticed that there's no release from 2019 feb so code.sajari.com might be sending older library is there any way to maybe pre-release? version or configure CI to do that

The text was updated successfully, but these errors were encountered:

stuta · 2022-05-13T16:16:04Z

I have the same problem, in Ubuntu x64 and OSX arm M1 mac. No errors, no meta info or content.

jespino mentioned this issue Jul 28, 2022

use as the default "good" and "neargood" for html when ReadabilityUseClasses is empty #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to parse HTML #78

Unable to parse HTML #78

harshzalavadiya commented Jul 27, 2020

stuta commented May 13, 2022

Unable to parse HTML #78

Unable to parse HTML #78

Comments

harshzalavadiya commented Jul 27, 2020

stuta commented May 13, 2022