quickstart, sql, structured streaming 문서 한글화 (by Connect 재단)

spark-korea · Jul 28, 2019 · 7ce64ad · 7ce64ad
1 parent 4a4429e
commit 7ce64ad
Show file tree

Hide file tree

Showing 28 changed files with 1,277 additions and 3,553 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,15 @@
+# OS generated files #
+######################
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# Jekyll generated files #
+######################
 _site/
 .sass-cache/
 .jekyll-cache/

diff --git a/docs/_plugins/include_example.rb b/docs/_plugins/include_example.rb
@@ -58,8 +58,7 @@ def render(context)
 
       rendered_code = Pygments.highlight(code, :lexer => @lang)
 
-      hint = "<div><small>Find full example code at " \
-        "\"examples/src/main/#{snippet_file}\" in the Spark repo.</small></div>"
+      hint = "<div><small>스파크 저장소의 \"examples/src/main/#{snippet_file}\"에서 전체 예제 코드를 볼 수 있습니다.</small></div>"
 
       rendered_code + hint
     end

diff --git a/docs/quick-start.md b/docs/quick-start.md
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md
diff --git a/docs/sql-data-sources-hive-tables.md b/docs/sql-data-sources-hive-tables.md
@@ -1,154 +1,89 @@
 ---
 layout: global
-title: Hive Tables
-displayTitle: Hive Tables
+title: Hive 테이블
+displayTitle: Hive 테이블
 ---
 
 * Table of contents
 {:toc}
 
-Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
-However, since Hive has a large number of dependencies, these dependencies are not included in the
-default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
-automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
-they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
-access data stored in Hive.
-
-Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
-and `hdfs-site.xml` (for HDFS configuration) file in `conf/`.
-
-When working with Hive, one must instantiate `SparkSession` with Hive support, including
-connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
-Users who do not have an existing Hive deployment can still enable Hive support. When not configured
-by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
-creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
-`spark-warehouse` in the current directory that the Spark application is started. Note that
-the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
-Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
-You may need to grant write privilege to the user who starts the Spark application.
+스파크 SQL은 Apache Hive에 저장된 데이터에 대한 읽기/쓰기를 지원합니다. Hive가 이미 많은 의존 라이브러리를 포함하고 있기때문에, 기본 스파크 배포판은 이 의존 라이브러리를 포함하고 있지 않습니다. Hive의 의존성 라이브러리를 classpath에서 찾을 수 있으면, 스파크는 이를 자동으로 로드합니다. 모든 작업 노드가 Hive에 저장된 데이터에 접근하기 위해 Hive 직렬화/역직렬화 라이브러리(SerDe) 그리고 Hive 의존 라이브러리는 모든 작업 노드에서 접근 가능해야 합니다.
+
+Hive 관련 설정을 하기 위해서는 conf/ 안에 `hive-site.xml`, `core-site.xml`(보안 설정용)과 `hdfs-site.xml`(HDFS 설정용)파일을 넣어 주면 됩니다.
+
+Hive를 사용할 때, SparkSession를 객체에 지속되는 Hive 메타스토어로의 연결성, Hive SerDe, Hive 사용자 정의 함수 등의 기능을 설정할 수 있습니다. Hive 배포판이 설치되어 있지 않더라도 Hive 지원을 활성화할 수 있습니다. `hive-site.xml`이 설정되어 있지 않은 경우, 현재 디렉토리에서 `metastore_db`를 자동으로 생성하고 `spark.sql.warehouse.dir`에 설정된 디렉토리를 생성합니다. `spark-warehouse`의 기본 디렉토리는 스파크 애플리케이션을 시작한 현재 디렉토리입니다. `hive-site.xml`의 `hive.metastore.warehouse.dir `속성은 스파크 2.0.0 버전부터 더 이상 지원되지 않으며, 대신 warehouse에서 데이터베이스의 기본 위치를 명시하려면 `spark.sql.warehouse.dir`을 사용해야 합니다. 스파크 애플리케이션을 실행하는 유저에게 쓰기 권한의 승인이 필요할 수 있습니다.
 
 <div class="codetabs">
 
 <div data-lang="scala"  markdown="1">
 {% include_example spark_hive scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala %}
 </div>
 
-<div data-lang="java"  markdown="1">
-{% include_example spark_hive java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java %}
-</div>
-
 <div data-lang="python"  markdown="1">
 {% include_example spark_hive python/sql/hive.py %}
 </div>
 
-<div data-lang="r"  markdown="1">
-
-When working with Hive one must instantiate `SparkSession` with Hive support. This
-adds support for finding tables in the MetaStore and writing queries using HiveQL.
-
-{% include_example spark_hive r/RSparkSQLExample.R %}
-
-</div>
 </div>
 
-### Specifying storage format for Hive tables
+### Hive 테이블의 저장 형식 지정하기
 
-When you create a Hive table, you need to define how this table should read/write data from/to file system,
-i.e. the "input format" and "output format". You also need to define how this table should deserialize the data
-to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage
-format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`.
-By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when
-creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it.
+Hive 테이블을 생성할 때, 이 테이블이 어떻게 파일시스템에서/으로 데이터를 읽고/쓸지 정의해야 합니다. 다시 말해 "입력 형식"과 "출력 형식"을 정의해야 합니다. 또한, 이 테이블이 데이터를 로우로 역직렬화하거나 로우를 데이터로 직렬화하는 방식(serde)도 정의해야 합니다. 아래의 옵션("serde", "input format", "output format")을 사용하여 `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet') `와 같이 저장 형식을 명시할 수 있습니다. 기본적으로, 테이블 파일은 플레인 텍스트(plain text)로 읽어들입니다. 단, 테이블을 생성할 때 Hive의 스토리지 핸들러 기능은 아직 지원되지 않으므로, Hive에서 직접 저장소 핸들러를 사용하여 테이블을 생성하고 스파크 SQL에서 읽어오는 방법을 사용할 수 있습니다.
 
 <table class="table">
-  <tr><th>Property Name</th><th>Meaning</th></tr>
+  <tr><th>속성 이름</th><th>의미</th></tr>
   <tr>
     <td><code>fileFormat</code></td>
-    <td>
-      A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and
-      "output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'.
-    </td>
+    <td>fileForamat은 "serde", "input format", "output format"등과 같은 저장 형식 명세의 한 종류입니다. 현재 6가지의 fileFormat을 지원합니다: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile', 'avro'</td>
   </tr>
-
   <tr>
     <td><code>inputFormat, outputFormat</code></td>
-    <td>
-      These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal,
-      e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in pair, and you can not
-      specify them if you already specified the `fileFormat` option.
-    </td>
+    <td>이 두 옵션은 글자 그대로 사용할 `InputFormat`과 `OutputFormat`의 이름을 지정합니다(문자열 타입).예를 들면, `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`와 같습니다. 이 두 가지 옵션은 한 쌍으로 함께 사용하며, `fileForamt` 옵션을 이미 사용하였다면 이 옵션은 사용할 수 없습니다.</td>
   </tr>
-
   <tr>
     <td><code>serde</code></td>
-    <td>
-      This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option
-      if the given `fileFormat` already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
-      don't include the serde information and you can use this option with these 3 fileFormats.
-    </td>
+    <td>seder 클래스의 이름을 명시합니다. `fileFormat` 옵션이 이미 명시되어 있고 여기에 serde에 대한 정보가 포함되어 있다면 이 옵션을 사용할 수 없습니다. 현재, 6가지의 fileFormat 옵션 중 "sequencefile", "textfile", "rcfile" 세 가지 옵션은 serde에 대한 정보를 포함하지 않으므로, fileFormat에서 이 세 가지 옵션을 사용하고 있을 때는 이 옵션을 사용할 수 있습니다.</td>
   </tr>
-
   <tr>
     <td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td>
-    <td>
-      These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows.
-    </td>
+    <td>fileFormat 옵션으로 "textfile"이 지정되어 있을 때만 사용가능합니다. 필드가 구분된 파일(delimited file)을 로우로 변환하는 방법을 정의합니다.</td>
   </tr>
 </table>
 
-All other properties defined with `OPTIONS` will be regarded as Hive serde properties.
+`OPTIONS` 구문으로 정의되는 다른 모든 속성은 Hive serde 속성으로 간주됩니다.
+
+### 서로 다른 버전의 Hive 메타스토어와 연동하기
 
-### Interacting with Different Versions of Hive Metastore
+스파크 SQL의 Hive 지원에서 가장 중요한 부분 중 하나는, 스파크 SQL이 Hive 테이블의 메타데이터에 접근할 수 있도록 하는 Hive 메타스토어와의 연동 기능입니다. 스파크 1.4.0 버전부터, 아래에 설명된 설정을 사용하면, 단일 스파크 SQL 빌드에서 서로 다른 버전의 Hive 메타스토어에 쿼리를 실행할 수 있습니다. 연동하는 메타스토어 Hive의 버전과는 별개로, 스파크 SQL은 Hive 1.2.1 버전을 기준으로 컴파일되며 이 버전에 포함된 클래스(serde, UDF, UDAF 등)를 내부적으로 사용합니다.
 
-One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
-which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
-build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
-Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
-will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
+아래의 옵션을 사용하여 메타데이터를 받아올 때 사용되는 Hive 버전을 설정할 수 있습니다:
 
-The following options can be used to configure the version of Hive that is used to retrieve metadata:
 
 <table class="table">
-  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+  <tr><th>속성 이름</th><th>기본값</th><th>의미</th></tr>
   <tr>
     <td><code>spark.sql.hive.metastore.version</code></td>
     <td><code>1.2.1</code></td>
-    <td>
-      Version of the Hive metastore. Available
-      options are <code>0.12.0</code> through <code>2.3.3</code>.
-    </td>
+    <td>Hive 메타스토어의 버전. <code>0.12.0</code> 버전부터 <code>2.3.3</code> 버전까지 사용할 수 있습니다.</td>
   </tr>
   <tr>
     <td><code>spark.sql.hive.metastore.jars</code></td>
     <td><code>builtin</code></td>
     <td>
-      Location of the jars that should be used to instantiate the HiveMetastoreClient. This
-      property can be one of three options:
+      Hive 메타스토어에 연결할 때 사용하는 HiveMetastoreClient 객체를 생성하는데 사용될 jar 파일의 위치. 다음 세 가지 옵션이 사용 가능합니다:
       <ol>
         <li><code>builtin</code></li>
-        Use Hive 1.2.1, which is bundled with the Spark assembly when <code>-Phive</code> is
-        enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
-        either <code>1.2.1</code> or not defined.
+        <code>-Phive</code>이 활성화되어 있을 때 스파크에 포함되어 있는 Hive 1.2.1을 사용합니다. 이 옵션을 사용하면 <code>spark.sql.hive.metastore.version</code>는 1.2.1이 되거나 정의되지 않아야 합니다.
         <li><code>maven</code></li>
-        Use Hive jars of specified version downloaded from Maven repositories. This configuration
-        is not generally recommended for production deployments.
-        <li>A classpath in the standard format for the JVM. This classpath must include all of Hive
-        and its dependencies, including the correct version of Hadoop. These jars only need to be
-        present on the driver, but if you are running in yarn cluster mode then you must ensure
-        they are packaged with your application.</li>
+        Maven 저장소에서 명시된 버전의 Hive jar를 다운로드하여 사용합니다. 이 설정을 운영 환경(production environment)에서 사용하는 것은 추천하지 않습니다.
+        <li>JVM의 표준 형식 classpath. 이 classpath는 Hive와 올바른 버전의 Hadoop을 포함한 모든 의존 라이브러리를 포함해야 합니다. 이 jar 파일은 드라이버에서 접근 가능해야하며, yarn 클러스터 모드에서 실행하고자 한다면 애플케이션으로 패키지화되어 있어야 합니다.</li>
       </ol>
     </td>
   </tr>
   <tr>
     <td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
     <td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td>
     <td>
-      <p>
-        A comma-separated list of class prefixes that should be loaded using the classloader that is
-        shared between Spark SQL and a specific version of Hive. An example of classes that should
-        be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
-        to be shared are those that interact with classes that are already shared. For example,
-        custom appenders that are used by log4j.
+      <p>스파크 SQL과 (특정 버전의) Hive 사이에 공유되는 classloader를 사용하여 로드해야 하는 클래스들의 접두사 목록(쉼표로 구분). 예를 들면, Hive 메타스토어와 연결하는 데 사용되는 JDBC 드라이버 목록 같은 경우입니다. (역자 주: Hive의 메타스토어로는 MySQL, PostgreSQL 등의 데이터베이스를 사용할 수 있습니다. MySQL에 연결하려면 <code>com.mysql.jdbc</code> 패키지의 클래스가, PostgreSQL에 연결하려면 <code>org.postgresql</code> 패키지의 클래스를 사용해야 합니다. 각각의 경우 <code>com.mysql.jdbc</code>, <code>org.postgresql</code>가 지정되어야 합니다.) 이미 공유되고 있는 클래스와의 상호작용을 위해 필요한 클래스의 접두사들 역시 명시되어야 합니다. (예: log4j에서 사용하는 사용자 정의 Appender)
       </p>
     </td>
   </tr>
@@ -157,9 +92,7 @@ The following options can be used to configure the version of Hive that is used
     <td><code>(empty)</code></td>
     <td>
       <p>
-        A comma separated list of class prefixes that should explicitly be reloaded for each version
-        of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
-        prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
+        스파크 SQL이 붙는 Hive 각각의 버전에 따라 명시적으로 로드되어야 하는 클래스 접두사 목록(쉼표로 구분). 예를 들어 접두사를 공유하는 식으로 선언되는 게 보통인 Hive UDF가 여기에 포함됩니다. (예: <code>org.apache.spark.*</code>)
       </p>
     </td>
   </tr>