Skip to content
ebiiii edited this page Oct 27, 2014 · 1 revision

Data Clean Virtual Sensor

This processing class runs the data cleaning module of GSN on an input stream and creates 4 additional streams (processed, dirtiness, distance and quality).

  • Processed: is the regenerated value when run through the selected model (constant, linear, quadratic, chebyschev_deg1, chebyschev_deg2, chebyschev_deg3 or arma_garch(*) ).
  • Dirtiness: is an indicator (between 0 and 1) which determines whether the data point is considered as dirty or not. Dirtiness is based on the distance between the actual data value and the modeled value. If the distance is larger than the error bound, then the data point is marked dirty (dirtiness=1).
  • Distance: distance between the actual data value and the modeled value.
  • Quality: quality metric, available only with ARMA GARCH model, other models simply generate zeros. quality = 1 - (stream - processed) / (3 * sigma)

(*) arma_garch model requires an R server to run models. For more details about how to set an R server : http://rosuda.org/JRI/

Sample code:

<virtual-sensor name="dataclean_example" priority="10">
    <processing-class>
        <class-name>gsn.vsensor.DataCleanVirtualSensor</class-name>
        <init-params>
            <param name="model">chebyschev_deg3</param> <!-- possible values: constant, linear, quadratic, chebyschev_deg1, chebyschev_deg2, chebyschev_deg3, arma_garch -->
            <param name="error_bound">5</param>
            <param name="window_size">100</param>
                </init-params>
        <output-structure>
            <field name="stream" type="double"/>
            <field name="processed" type="double"/>
            <field name="dirtiness" type="double"/>
            <field name="distance" type="double"/>
            <field name="quality" type="double"/>
        </output-structure>
    </processing-class>
    <description>This virtual sensor uses data cleaning engine for marking dirty data</description>
    <life-cycle pool-size="10"/>
    <addressing>
        <predicate key="geographical">Sensor</predicate>
        <predicate key="LATITUDE">46.520000</predicate>
        <predicate key="LONGITUDE">6.565000</predicate>
    </addressing>
    <storage history-size="5m"/>
    <streams>
        <stream name="input1">
            <source alias="source1" sampling-rate="1" storage-size="1">
                <address wrapper="multiformat">
                    <predicate key="HOST">localhost</predicate>
                    <predicate key="PORT">22001</predicate>
                </address>
                <query>SELECT light, temperature, packet_type, timed FROM wrapper</query>
            </source>
            <query>SELECT temperature, timed FROM source1</query>
        </stream>
    </streams>
</virtual-sensor>
Clone this wiki locally