Skip to content

Commit

Permalink
deploy: 0fdb97a
Browse files Browse the repository at this point in the history
  • Loading branch information
yxdyc committed Jul 17, 2024
1 parent 493acde commit 67a4f78
Show file tree
Hide file tree
Showing 13 changed files with 123 additions and 104 deletions.
8 changes: 4 additions & 4 deletions _modules/data_juicer/analysis/column_wise_analysis.html
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ <h1>Source code for data_juicer.analysis.column_wise_analysis</h1><div class="hi
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Initialization method</span>

<span class="sd"> :param dataset: the dataset to be analysed</span>
<span class="sd"> :param dataset: the dataset to be analyzed</span>
<span class="sd"> :param output_path: path to store the analysis results</span>
<span class="sd"> :param overall_result: optional precomputed overall stats result</span>
<span class="sd"> :param save_stats_in_one_file: whether save all analysis figures of all</span>
Expand All @@ -157,15 +157,15 @@ <h1>Source code for data_juicer.analysis.column_wise_analysis</h1><div class="hi
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output_path</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output_path</span><span class="p">)</span>

<span class="c1"># if no overall description provided, analyse it from scratch</span>
<span class="c1"># if no overall description provided, analyze it from scratch</span>
<span class="k">if</span> <span class="n">overall_result</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">oa</span> <span class="o">=</span> <span class="n">OverallAnalysis</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">output_path</span><span class="p">)</span>
<span class="n">overall_result</span> <span class="o">=</span> <span class="n">oa</span><span class="o">.</span><span class="n">analyse</span><span class="p">()</span>
<span class="n">overall_result</span> <span class="o">=</span> <span class="n">oa</span><span class="o">.</span><span class="n">analyze</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">overall_result</span> <span class="o">=</span> <span class="n">overall_result</span>

<span class="bp">self</span><span class="o">.</span><span class="n">save_stats_in_one_file</span> <span class="o">=</span> <span class="n">save_stats_in_one_file</span></div>

<div class="viewcode-block" id="ColumnWiseAnalysis.analyse"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.ColumnWiseAnalysis.analyse">[docs]</a> <span class="k">def</span> <span class="nf">analyse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">show_percentiles</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">show</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">skip_export</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<div class="viewcode-block" id="ColumnWiseAnalysis.analyze"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.ColumnWiseAnalysis.analyze">[docs]</a> <span class="k">def</span> <span class="nf">analyze</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">show_percentiles</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">show</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">skip_export</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Apply analysis and draw the analysis figure for stats.</span>

Expand Down
16 changes: 8 additions & 8 deletions _modules/data_juicer/analysis/diversity_analysis.html
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,9 @@ <h1>Source code for data_juicer.analysis.diversity_analysis</h1><div class="high
<span class="sd"> Find the verb and its object closest to the root of lexical tree of input</span>
<span class="sd"> string.</span>

<span class="sd"> :param nlp: the diversity model to analyse the diversity strings</span>
<span class="sd"> :param s: the string to be analysed</span>
<span class="sd"> :param first_sent: whether to analyse the first sentence in the</span>
<span class="sd"> :param nlp: the diversity model to analyze the diversity strings</span>
<span class="sd"> :param s: the string to be analyzed</span>
<span class="sd"> :param first_sent: whether to analyze the first sentence in the</span>
<span class="sd"> input string only. If it&#39;s true, return the analysis result of</span>
<span class="sd"> the first sentence no matter it&#39;s valid or not. If it&#39;s false,</span>
<span class="sd"> return the first valid result over all sentences</span>
Expand Down Expand Up @@ -171,7 +171,7 @@ <h1>Source code for data_juicer.analysis.diversity_analysis</h1><div class="high
<span class="sd"> result.&quot;&quot;&quot;</span>

<div class="viewcode-block" id="DiversityAnalysis.__init__"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.DiversityAnalysis.__init__">[docs]</a> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dataset</span><span class="p">,</span> <span class="n">output_path</span><span class="p">,</span> <span class="n">lang_or_model</span><span class="o">=</span><span class="s1">&#39;en&#39;</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Initialization method :param dataset: the dataset to be analysed</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Initialization method :param dataset: the dataset to be analyzed</span>
<span class="sd"> :param output_path: path to store the analysis results :param</span>
<span class="sd"> lang_or_model: the diversity model or a specific language used to load</span>
<span class="sd"> the diversity model.&quot;&quot;&quot;</span>
Expand All @@ -188,7 +188,7 @@ <h1>Source code for data_juicer.analysis.diversity_analysis</h1><div class="high

<span class="sd"> :param lang_or_model: the diversity model or a specific language</span>
<span class="sd"> used to load the diversity model</span>
<span class="sd"> :param column_name: the name of column to be analysed</span>
<span class="sd"> :param column_name: the name of column to be analyzed</span>
<span class="sd"> :return: the analysis result.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="c1"># load diversity model</span>
Expand All @@ -213,7 +213,7 @@ <h1>Source code for data_juicer.analysis.diversity_analysis</h1><div class="high
<span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">dataset</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">find_verb_noun</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span></div>

<div class="viewcode-block" id="DiversityAnalysis.analyse"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.DiversityAnalysis.analyse">[docs]</a> <span class="k">def</span> <span class="nf">analyse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
<div class="viewcode-block" id="DiversityAnalysis.analyze"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.DiversityAnalysis.analyze">[docs]</a> <span class="k">def</span> <span class="nf">analyze</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
<span class="n">lang_or_model</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">column_name</span><span class="o">=</span><span class="s1">&#39;text&#39;</span><span class="p">,</span>
<span class="n">postproc_func</span><span class="o">=</span><span class="n">get_diversity</span><span class="p">,</span>
Expand All @@ -223,8 +223,8 @@ <h1>Source code for data_juicer.analysis.diversity_analysis</h1><div class="high

<span class="sd"> :param lang_or_model: the diversity model or a specific language</span>
<span class="sd"> used to load the diversity model</span>
<span class="sd"> :param column_name: the name of column to be analysed</span>
<span class="sd"> :param postproc_func: function to analyse diversity. In default,</span>
<span class="sd"> :param column_name: the name of column to be analyzed</span>
<span class="sd"> :param postproc_func: function to analyze diversity. In default,</span>
<span class="sd"> it&#39;s function get_diversity</span>
<span class="sd"> :param postproc_kwarg: arguments of the postproc_func</span>
<span class="sd"> :return:</span>
Expand Down
14 changes: 7 additions & 7 deletions _modules/data_juicer/analysis/overall_analysis.html
Original file line number Diff line number Diff line change
Expand Up @@ -105,17 +105,17 @@ <h1>Source code for data_juicer.analysis.overall_analysis</h1><div class="highli
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Initialization method.</span>

<span class="sd"> :param dataset: the dataset to be analysed</span>
<span class="sd"> :param dataset: the dataset to be analyzed</span>
<span class="sd"> :param output_path: path to store the analysis results.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stats</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">dataset</span><span class="p">[</span><span class="n">Fields</span><span class="o">.</span><span class="n">stats</span><span class="p">])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">output_path</span> <span class="o">=</span> <span class="n">output_path</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output_path</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output_path</span><span class="p">)</span>

<span class="c1"># default percentiles to analyse</span>
<span class="c1"># default percentiles to analyze</span>
<span class="bp">self</span><span class="o">.</span><span class="n">default_percentiles</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
<span class="c1"># supported dtypes of column to be analysed</span>
<span class="c1"># supported dtypes of column to be analyzed</span>
<span class="c1"># Notice: there won&#39;t be mixed types in a column because the stats is</span>
<span class="c1"># obtained from Dataset, which doesn&#39;t allow mixed types.</span>
<span class="c1"># Notice: for now, stats can only be:</span>
Expand All @@ -132,7 +132,7 @@ <h1>Source code for data_juicer.analysis.overall_analysis</h1><div class="highli
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">first</span><span class="p">)</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">supported_object_types</span><span class="p">:</span>
<span class="n">logger</span><span class="o">.</span><span class="n">warning</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;There is a column of stats with type &#39;</span>
<span class="sa">f</span><span class="s1">&#39;[</span><span class="si">{</span><span class="nb">type</span><span class="p">(</span><span class="n">first</span><span class="p">)</span><span class="si">}</span><span class="s1">], which is not supported to be &#39;</span>
<span class="sa">f</span><span class="s1">&#39;analysed for now.&#39;</span><span class="p">)</span>
<span class="sa">f</span><span class="s1">&#39;analyzed for now.&#39;</span><span class="p">)</span>
<span class="k">return</span> <span class="kc">None</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">first</span><span class="p">)</span> <span class="ow">is</span> <span class="nb">str</span><span class="p">:</span>
<span class="c1"># describe(include = &#39;all&#39;) can analyze the string type</span>
Expand All @@ -142,13 +142,13 @@ <h1>Source code for data_juicer.analysis.overall_analysis</h1><div class="highli
<span class="n">col</span> <span class="o">=</span> <span class="n">col</span><span class="o">.</span><span class="n">explode</span><span class="p">()</span><span class="o">.</span><span class="n">infer_objects</span><span class="p">()</span>
<span class="k">return</span> <span class="n">col</span></div>

<div class="viewcode-block" id="OverallAnalysis.analyse"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.OverallAnalysis.analyse">[docs]</a> <span class="k">def</span> <span class="nf">analyse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">percentiles</span><span class="o">=</span><span class="p">[],</span> <span class="n">num_proc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">skip_export</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<div class="viewcode-block" id="OverallAnalysis.analyze"><a class="viewcode-back" href="../../../data_juicer.analysis.html#data_juicer.analysis.OverallAnalysis.analyze">[docs]</a> <span class="k">def</span> <span class="nf">analyze</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">percentiles</span><span class="o">=</span><span class="p">[],</span> <span class="n">num_proc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">skip_export</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Apply overall analysis on the whole dataset based on the describe</span>
<span class="sd"> method of pandas.</span>

<span class="sd"> :param percentiles: percentiles to analyse</span>
<span class="sd"> :param num_proc: number of processes to analyse the dataset</span>
<span class="sd"> :param percentiles: percentiles to analyze</span>
<span class="sd"> :param num_proc: number of processes to analyze the dataset</span>
<span class="sd"> :param skip_export: whether export the results to disk</span>
<span class="sd"> :return: the overall analysis result.</span>
<span class="sd"> &quot;&quot;&quot;</span>
Expand Down
Loading

0 comments on commit 67a4f78

Please sign in to comment.