Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algolia integration #7

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Algolia integration #7

wants to merge 3 commits into from

Conversation

gabfr
Copy link
Owner

@gabfr gabfr commented Oct 31, 2019

Working on get the Algolia integration running smoothly as possible :)

  • First of all I did an adaptation to sync all jobs in the database

But, before syncing all jobs, we may need to create a way to "estimate" the published_at of the jobs from angel.co.

  • Idea: theoretically the jobs are ordered from newest to oldest on the angel.co search. So each "scroll" action in the selenium we should degrade one week (?) from now

@gabfr
Copy link
Owner Author

gabfr commented Oct 31, 2019

Currently I'm having an issue fetching from psycopg2 and saving with algolia.save_objects:

*** Reading local file: /usr/local/airflow/logs/algoliasearch_index_jobs_dag/index_jobs_task/2019-10-24T00:00:00+00:00/10.log
[2019-10-31 19:17:13,602] {{taskinstance.py:616}} INFO - Dependencies all met for <TaskInstance: algoliasearch_index_jobs_dag.index_jobs_task 2019-10-24T00:00:00+00:00 [queued]>
[2019-10-31 19:17:13,627] {{taskinstance.py:616}} INFO - Dependencies all met for <TaskInstance: algoliasearch_index_jobs_dag.index_jobs_task 2019-10-24T00:00:00+00:00 [queued]>
[2019-10-31 19:17:13,628] {{taskinstance.py:834}} INFO - 
--------------------------------------------------------------------------------
[2019-10-31 19:17:13,628] {{taskinstance.py:835}} INFO - Starting attempt 10 of 11
[2019-10-31 19:17:13,628] {{taskinstance.py:836}} INFO - 
--------------------------------------------------------------------------------
[2019-10-31 19:17:13,637] {{taskinstance.py:855}} INFO - Executing <Task(PythonOperator): index_jobs_task> on 2019-10-24T00:00:00+00:00
[2019-10-31 19:17:13,637] {{base_task_runner.py:133}} INFO - Running: ['airflow', 'run', 'algoliasearch_index_jobs_dag', 'index_jobs_task', '2019-10-24T00:00:00+00:00', '--job_id', '692', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/algoliasearch_index_jobs_dag.py', '--cfg_path', '/tmp/tmpib19cwz3']
[2019-10-31 19:17:14,587] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task [2019-10-31 19:17:14,586] {{settings.py:213}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=79254
[2019-10-31 19:17:14,611] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task /usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-10-31 19:17:14,611] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   """)
[2019-10-31 19:17:15,199] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task [2019-10-31 19:17:15,198] {{__init__.py:51}} INFO - Using executor LocalExecutor
[2019-10-31 19:17:15,606] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task [2019-10-31 19:17:15,605] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags/algoliasearch_index_jobs_dag.py
[2019-10-31 19:17:15,667] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task [2019-10-31 19:17:15,667] {{cli.py:516}} INFO - Running <TaskInstance: algoliasearch_index_jobs_dag.index_jobs_task 2019-10-24T00:00:00+00:00 [running]> on host af1bd95ab17b
[2019-10-31 19:17:15,693] {{python_operator.py:105}} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=algoliasearch_index_jobs_dag
AIRFLOW_CTX_TASK_ID=index_jobs_task
AIRFLOW_CTX_EXECUTION_DATE=2019-10-24T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2019-10-24T00:00:00+00:00
[2019-10-31 19:17:15,707] {{logging_mixin.py:95}} INFO - [�[34m2019-10-31 19:17:15,707�[0m] {{�[34mbase_hook.py:�[0m84}} INFO�[0m - Using connection to: �[1mid: pgsql. Host: postgres, Port: 5432, Schema: waw, Login: airflow, Password: XXXXXXXX, extra: {}�[0m�[0m
[2019-10-31 19:17:18,619] {{taskinstance.py:1047}} ERROR - Record at the position 930 objectID=1c26fb627cb6c11a89d51c1e1f1485a2 is too big size=16595 bytes. Contact us if you need an extended quota
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 922, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/algoliasearch_index_jobs_dag.py", line 50, in index_jobs
    index.save_objects(rows)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 86, in save_objects
    request_options)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 563, in _chunk
    self._raw_batch(requests, request_options))
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 583, in _raw_batch
    request_options
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 42, in write
    return self.request(verb, hosts, path, data, request_options, timeout)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 75, in request
    return self.retry(hosts, request, relative_url)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 96, in retry
    raise RequestException(content, response.status_code)
algoliasearch.exceptions.RequestException: Record at the position 930 objectID=1c26fb627cb6c11a89d51c1e1f1485a2 is too big size=16595 bytes. Contact us if you need an extended quota
[2019-10-31 19:17:18,628] {{taskinstance.py:1070}} INFO - Marking task as UP_FOR_RETRY
[2019-10-31 19:17:18,655] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task Traceback (most recent call last):
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/bin/airflow", line 32, in <module>
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     args.func(args)
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return f(*args, **kwargs)
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 522, in run
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     _run(args, dag, ti)
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 440, in _run
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     pool=args.pool,
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/utils/db.py", line 74, in wrapper
[2019-10-31 19:17:18,656] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return func(*args, **kwargs)
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 922, in _run_raw_task
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     result = task_copy.execute(context=context)
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return_value = self.execute_callable()
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return self.python_callable(*self.op_args, **self.op_kwargs)
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/dags/algoliasearch_index_jobs_dag.py", line 50, in index_jobs
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     index.save_objects(rows)
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 86, in save_objects
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     request_options)
[2019-10-31 19:17:18,657] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 563, in _chunk
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     self._raw_batch(requests, request_options))
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/search_index.py", line 583, in _raw_batch
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     request_options
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 42, in write
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return self.request(verb, hosts, path, data, request_options, timeout)
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 75, in request
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     return self.retry(hosts, request, relative_url)
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task   File "/usr/local/airflow/.local/lib/python3.7/site-packages/algoliasearch/http/transporter.py", line 96, in retry
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task     raise RequestException(content, response.status_code)
[2019-10-31 19:17:18,658] {{base_task_runner.py:115}} INFO - Job 692: Subtask index_jobs_task algoliasearch.exceptions.RequestException: Record at the position 930 objectID=1c26fb627cb6c11a89d51c1e1f1485a2 is too big size=16595 bytes. Contact us if you need an extended quota
[2019-10-31 19:17:23,574] {{logging_mixin.py:95}} INFO - [�[34m2019-10-31 19:17:23,573�[0m] {{�[34mlocal_task_job.py:�[0m105}} INFO�[0m - Task exited with return code 1�[0m

@gabfr
Copy link
Owner Author

gabfr commented Oct 31, 2019

I've found a page that helps us with the issue above: https://www.algolia.com/doc/faq/basics/is-there-a-size-limit-for-my-index-records/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant