Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with bad html #32

Open
kouk opened this issue Oct 14, 2015 · 9 comments
Open

Dealing with bad html #32

kouk opened this issue Oct 14, 2015 · 9 comments

Comments

@kouk
Copy link

kouk commented Oct 14, 2015

Depending on the source markup there are times when the trumpet pipeline will not finish processing even though the response has been written entirely (see below). It would be great if there was a way to set a timeout and cancel the pipeline somehow. For example:

var http = require('http'),
    connect = require('connect'),
    through = require('through'),
    httpProxy = require('http-proxy');


var selects = [];
var simpleselect = {};

simpleselect.query = 'body';
simpleselect.func = function (node) {
    var ts = node.createStream();
    ts.pipe(through(null, function(){
        console.log("Injecting");
        this.queue("FOOBAR");
        this.queue(null);
    })).pipe(ts);

    ts.on('end', function() {
     console.log("BODY");
    });
}

selects.push(simpleselect);

var app = connect();

var proxy = httpProxy.createProxyServer({
    target: 'http://localhost:8002'
})

app.use(require('harmon')([], selects, true));

app.use(
  function (req, res) {
    proxy.web(req, res);
  }
);

http.createServer(app).listen(8001);

http.createServer(function(req, res) {
    res.setHeader('Content-Type', 'text/html; charset=UTF-8');
    res.setHeader('Transfer-Encoding', 'chunked');
    res.write('<html><head><title>foo</title></head><body><h1>Foo</h1>');
    /* With this line the request hangs
    res.write('<div><svg><defs><circle /></defs><circle /></svg></div>');
    */
    res.end('</body></html>');
}).listen(8002);

If you run the example and curl http://localhost:8001 it works fine. But if you uncomment the commented line and try again it hangs. I've tracked down the specific underlying reason for this in the html-select module, and I'll raise this specific issue with them. But what I'm interested in is shielding myself from these kinds of bugs (which may be triggered by broken proxied web pages) by doing setTimeout and aborting the trumpet pipeline somehow. I've tried various combinations of res.end() and tr.end() but the stream from tr.createStream just never seems to die. I would appreciate if you could show me a way, perhaps even a totally different approach.

@kouk
Copy link
Author

kouk commented Oct 14, 2015

There's also a discussion here which might be related: https://github.com/substack/html-tokenize/issues/13

But all this isn't really future proof, it would be great if somehow we can call ts.end() or node.abort() and have it kill all the streams that might be blocking.

@mspl13
Copy link

mspl13 commented Oct 18, 2015

I'm running in the same issue using this gist. If you uncomment line 45 and comment 44 out, it won't stop loading and you can't porceed with anything.
Couldn't find any solution to this point and can't really think of any way to stop it from loading...

@No9
Copy link
Owner

No9 commented Oct 19, 2015

From the notes here I agree this feels like an upstream fix.
But thanks for the steps to reproduce.

@mspl13
Copy link

mspl13 commented Oct 20, 2015

Sorry for seeming impatient (I'm not, just curious) but do you already know what you gonna do about it? Reporting this as a bug to the actual module that is affected or fix it by yourself?
Since you added the 'help-wanted' tag, do you wait for somebody you know to help you out with this?

@No9
Copy link
Owner

No9 commented Oct 20, 2015

Hi @mspl13 I am not actively looking at this at the moment.
Hence the help wanted tag.
I would love somebody to help with this if they have the bandwidth.

@chandransuresh
Copy link

chandransuresh commented May 23, 2018

Hi @No9
I faced issue with bad HTML as well ( "end" event does not get fired for bad HTML) , I ended up adding this code to fix it:
Would you let me create a PR with a change similar to the following? (need to get the Max timeout from configuration). Thanks.

let endEmittedInterval;

let endEmittedInterval;
tr.on('finish', () => {
            let intervalCount = 0;
            endEmittedInterval = setInterval(() => {
                try {
                    intervalCount += 1;
                    // make sure read and write are completed after finish
                    if (tr._writableState.ended === true && tr._tokenize._readableState.endEmitted === true) {
                        clearInterval(endEmittedInterval);
                        _end.call(res);
                    } else if (intervalCount >= 6000) {
                        // clear the runway interval after 60 seconds
                        // TODO: change 60 seconds to server's MAXIMUM_TIMEOUT value
                        clearInterval(endEmittedInterval);
                        _end.call(res);
                    }
                } catch (err) {
                    clearInterval(endEmittedInterval);
                    _end.call(res);
                }
            }, 10);
        });

@No9
Copy link
Owner

No9 commented May 23, 2018

Hi @chandransuresh
Thanks for your message.

Do you know if this patch to html-select fix your issue?
https://github.com/substack/html-select/pull/23

If it does we might want to fork that along with trumpet and publishing them in there own namespace?

If not then we can consider your workaround above and putting it into the harmon constructor
something like

var opts = {};
opts.timeout = 6000;
app.use(require('../')([], selects, opts));

Let me know how you get on

@chandransuresh
Copy link

chandransuresh commented May 29, 2018

Hi @No9 , Thanks for the response.
I tested with the change in kouk:master branch from substack/html-select#23. No, it does not fix the issue of broken HTML. In my sample page, there are bunch of tags that does not have any matching close tags.

@chandransuresh
Copy link

Hi @No9 ,
Sorry for the delay in getting back. Could you give me access to create a PR for malformed HTML fix?.
The changed file is in the following clone.
https://github.com/chandransuresh/harmon-malformed-html-fix/blob/fix-for-malformed-html/index.js

Thanks,
Suresh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants