forked from hpcc-systems/HPCC-Platform
-
Notifications
You must be signed in to change notification settings - Fork 0
/
FUTURE
235 lines (179 loc) · 7.5 KB
/
FUTURE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
This file contains an outline of some of the different areas we might pursue in the future. Once the plans become
concrete they will be added as issues with associated milestones.
Technology changes
==================
What technology changes do we need to ensure we adapt to?
Many cores
* Better use of lots of threads.
* Parallel PARSE, PROJECT and other activities that are cpu intensive.
* Dynamic adapting to number of available threads.
* Ensure multithreading code is efficient.
* Reduce critical sections and locking.
* More read ahead threads.
* Experiment with Sequential blocked read ahead.
More memory
* Dynamic resourcing in Thor.
* Dynamic caching of spill files.
SSDs
* What do the greater seek rates allow us to do differently?
Increased cost of power
Improved network speed
* The gap between network speed and disk speed is growing. What assumptions does that change?
Cloud support and Local v Remote files.
Architectural
=============
What changes to the underlying architecture do we want to make. Why?
Full windows support.
* Solve problems with SSH under windows.
* Solve problems of how to get/build third party libraries.
* Equivalent to the init system.
* Support 64bit windows.
Improvements to measurement and statistics
* Do we know where all the time is going?
- In the code generator?
- In the run time engines.
* What feedback would help ECL developers?
Combine roxie and hthor
* Extend roxie so it becomes a superset of hthor.
* May require a flag to indicate the mode (e.g., spill handling) but code base should be one.
* Allow roxie to listen on a dali queue.
Common up the row reading interfaces between roxie and thor.
* Makes it easier to pick up the system and work on it.
* Make it easier to provide utility classes (e.g., readahead, activities).
Support re-entrant global-graph execution
* Allow a query to call a C++ function which might then call another executeGraph() call.
* Opens up possibilities of more flexible code generation.
* Requires changes to code gen and engines.
* Requires parent extract supported by global graph execute.
* Thor???
Reduce the number of dlls in the system
* The number of dlls and dependencies can almost certainly be simplified.
Switch to OpenMPI or other framework
* Does it provide the capabilities we need?
* Would it be a suitable replacement for thor or roxie transport.
Generate more than one dll for each work unit?
* Allow more granular query compiling.
* Reduce the data required by on-demand roxie slaves.
* Allow remote filtering and projecting.
Extensible system
=================
What changes can we make to make it easier for third parties to extend?
What benefit might we get?
File formats
* Indexes
- Enable optimizations to our own formats.
- New implementations.
- Interfacing with external implementations.
* Files
- Compressed
- Hadoop
- Embedded resources
File locations/sources
* Rationalise the current logical filename syntax, and extend it.
Repositories
- Allow more flexibility and extendibility in the sources of files used as input to eclcc.
* Create a cleaner interface for accessing hierarchical ECL source.
* Building directly from Github
* Tar
* Compressed archive
C++ integration
* Make it easier to link libraries/blocks of c++ into programs.
* Improve support for C++ attributes (e.g., dependencies between attributes).
* Streaming of datasets to and from C++.
* Using third party libraries.
Activities
* Make it easier for 3rd parties to extend the activities in the system
* Allow user c++ activities to be defined.
New capabilities
================
What capabilities can we add to the system to make it solve more problems?
Better support for UTF8
* Current support isn't even documented.
* PARSE and a few other places (indexes?) need some more work.
Unicode support
* Better support in indexes.
* Expose work break semantics and support in PARSE.
* UTF8 DFAs in PARSE.
Thor debugger.
Support recursion
New activities
* DATASET(count, transform(counter))
More problem domains
- What would be required to support some of the following domains better?
* Biological/Genetic.
* Matrix processing/computationally intensive.
* Unicode free text processing.
* Better support for SAS/R.
* What hooks can we provide to make it easy for 3rd parties to implement?
Enterprise
==========
What extensions can we make to the enterprise system?
Repository
* Fix existing repository implementation - particularly cache issues
* Simplify and improve caching capabilities.
* Fix the current directory scanning.
* Allow more repositories types (see extensibility)
Legacy support tools
* Support tool to add imports, and clean up other changes that are required.
Encryption at rest
* Do we need it?
* How do we safetly distrubute the keys with the current system.
Redundancy
* Should we support 3 or more way redundancy?
Clean up query deployment
* Finish the query sets
Better testing
* Regression suite could do with a thorough overhall.
* Ideally some better coverage testing.
* Some queries that can be run as a benchmark for the system speed.
* File spray tests.
Streamed input support in Thor.
* Following on from the discussion with David and Dermot.
SQL interface into the system.
- Could this build on the mapping and joining fields for the roxie browser.
Dali hot/warm failover redundancy
Optimizations
=============
How can we improve the performance of the system
Optimize the complexity of the graphs that are run
* Code generator could track how sort orders are used and optimize the activities generated.
Dynamic resourcing
* Scope for Thor to select different implementations based on input data size.
* Dynamic row caching / combine multiple subgraphs as one.
New activity implementations
* If the data is held on a lustre file system, is there scope for new sort activities?
Reduce data transfer
* Local and remote helpers would significantly reduce the amount of data transferred for roxie keyed joins (and other activities).
Row representation
* More intelligent row serialize/deserialize (e.g. on index read slaves)
* Enable packing/alignment on rows.
* Allow sizes to be separate from their strings/datasets.
* Datasets and strings with smaller record counts.
* Link counted child rows (not just datasets)
* Maxcount(1) optimization
* Link counted strings
Speed up eclcc
* For some queries (e.g., NCF) a large proportion of the development time must be spent compiling and deploying the queries.
BCD library
* Remove the critical section by using a thread variable for the stack
* Improve code for the basic operations.
* Consider switch to a non-stack implementation.
Conditional actions in graphs
* Most of the work has been done for this we should explicitly aim to support and enable it.
* Need to finish WHEN support (e.g., implicit field projection from side-effects).
Graph representation
* Compress it
* Don't include the graphs in the SDS, retrieve from the workunit instead.
Improve implicit field projection
- Currently doesn't optimize nested record structures.
Improve the common sub expression processing in eclcc.
Improve generation of conditional expressions.
Implement costing for expressions and activities
* Would improve whether it was worth reordering, substituting etc.
Variable in <set>
* Should sometimes use a hash table.
* Special case of an associated array [#50371] E.g., MAP(dataset, { keyed } [,{extra}]);
Fix resourcing of inline datasets
* Currently the CSE for datasets executed within a transform is poor.
Optimize overly conditional code
* Often occurs when converting procedural code to ECL. Too many guard conditions are added to the ECL.