- 1 Community Bonding Period
- 2 Background Research
- 3 Development Status
- 4 Development Phase
Community Bonding Period
I made two patches for OpenCL (OCL) shot code. One patch refactors the existing SPH (Sphere) shot code, and the another patch implements EHY (Elliptical Hyperboloid) shot code.
|M0.1||fix OCL SPH shot routine compilation errors.||#341||TRUNK|
|M0.2||EHY shot routine in OCL.||#346||TRUNK|
|M1||ELL and ARB8 shot routines in OCL.||#370||TRUNK|
|M2||refactor dispatcher, shoot, optical renderer to process many rays in parallel in C when rendering an image or block.||BRANCH|
|M3||grid spatial partitioning in OCL.||#379||DONE|
|M4||GPU side database storage of OCL implemented primitives SPH. EHY, ELL, ARB8.||#392||IN PROGRESS|
|M5||port compute intensive or critical parts of the dispatcher, boolean evaluation, optical renderer to OCL.||?|
|M6||TOR and TGC shot routines in OCL.||#393||TRUNK|
|M7||BOT shot routine in OCL.||CANCELLED|
The ARB8, EHY, ELL, SPH, TOR, TGC, shot routines are in SVN trunk.
Week 1 : 25-31 May
- Created some example .g files in mged for the primitives to be implemented this week. The Quick Reference Card proved to be quite useful.
- Do the matrix ops for EHY (Elliptical Hyperboloid) in the OCL side.
- Made patch for ELL (Generalized Ellipsoid) and ARB8 (Arbitrary Polyhedron) OCL shots.
- M1 complete: ELL, ARB8 shot routines in OCL.
- Tried out a bunch of code browsing tools (cscope, LXR, doxygen, etc). The NetBeans IDE seems the most promising.
Week 2 : 1-7 Jun
- Read code to better understand the main rendering loop. It seems to be something like this:
do_frame() → do_run() → worker()* → do_pixel()* → rt_shootray()* → rt_*_shot()
- The code is recursive (which is problematic for OCL). I'll work on a simplified version of the rendering loop which only does the primary rays in C as a first approach. After I get the non-recursive parallel friendly C code I'll work on the OCL port.
- Updated project proposal on Google Melange.
- SVN r65153 fails to compile with a bogus error of an unused variable that's actually being used its just that GCC 4.9.1 is too dumb to figure that out.
- Upgraded Ubuntu and GCC.
- Made simple ray generation code in C.
- Made simple frame buffer write code in C.
- Made simple diffuse shading code in C.
Week 3 : 8-14 Jun
- Added the main boolean weaving code to our minimal renderer.
- Eliminated some gotos and made the code more thread safe.
- The simple renderer patches are in the mailing-list.
Week 4 : 15-21 Jun
- Added OpenMP compile support. Use OpenMP constructs to launch the rendering threads. This work still has some bugs in it.
- Alpha M2 patch. in mailing-list.
- Read code to better understand the main spatial partition construction routines. They seem to be something like this:
rt_prep_parallel() → rt_cut_it() → rt_nugrid_cut()
- We need something less complex that is more amenable to porting to OCL. So I will be implementing the Lagae & Dutré compact grid construction algorithm published at EGSR. First I will program in ANSI C then I will port the code to OpenCL.
- Started work on M3: grid spatial partitioning in OCL.
- ANSI C Lagae & Dutré grid construction code.
Week 5 : 22-28 Jun
- Took time off from the project to go to the CGI'15 conference.
Week 6 : 29 Jun-5 Jul
- GSoC Midterm Evaluations.
Weeks 7-8 : 6 Jul-12 Jul, 13 Jul-19 Jul
- Evaluating algorithms for grid construction.
- Selecting OCL kernels we can use to support grid construction. It seems PyOpenCL has some kernels we could use. Now the question is how to extricate the OpenCL/C from the Python...
- ANSI C grid traversal code.
- OCL grid construction on the GPU.
- M3 complete: grid spatial partitioning in OCL.
Weeks 9-10 : 20 Jul-26 Jul, 27 Jul-2 Aug
- Implemented GPU side solid database storage infrastructure.
- The OCL EHY shot code now uses the GPU solid database instead of creating the input buffers on every call.
- The code allows the primitive to decide how it is stored without imposing a convention. So one can use SoA, AoS, or whatever to store the data.
- Implemented GPU side solid database storage for SPH, ELL, ARB8.
- Extracted out some duplicated OCL code.
- M4 complete: GPU side database storage of OCL implemented primitives.
- OCL TOR (Torus) shot routine. Includes the higher order equation solver code.
- OCL TGC (Truncated General Cone) shot routine.
- Put equation solver in separate .cl file.
- M6 complete: TOR and TGC shot routines in OCL.
- General overhaul and cleanup of the OCL shot patches.
- Upgraded NVIDIA OpenCL drivers on my computer to CUDA 7.0.
- I drew up a tentative algorithm for the CSG raytrace:
# GPU execution GEN_RAYS(args) lengths = COMPUTE_LEN_SEGMENTS(rays, db) segs = ALLOC_SEGMENTS(lengths) segs = COMPUTE_SEGMENTS(rays, db, segs) # CPU execution waiting_segs = READ_SEGMENTS(segs) # merge with CPU computed segments for non-accelerated primitives finished_segs = RT_BOOL_WEAVE(waiting_segs) partitions = RT_BOOL_FINAL(finished_segs) pixels = VIEWSHADE(rays, db, partitions)
- This allows us to ultimately reuse the CPU code for boolean weaving, primitive normals, shaders, to have a 100% pixel accurate result. At the expense of a lot of memory traffic and CPU-side computation of some fairly maths intensive parts like the normal compute and shade. However I presently see no other way of having a 100% accurate result in the time we have available.
- Made OCL EHY shot code look exactly like the ANSI C version. Cleanups.
- Ran tests on OCL shot code to check for accuracy vs existing code:
# e.g. test for tgc make tgc tgc e tgc ; rt -o rt_tgc.pix # OR e tgc ; rt -o cl_tgc.pix pixdiff rt_tgc.pix cl_tgc.pix | pix-fb
# test results: arb8: pixdiff bytes: 777500 matching, 8932 off by 1, 0 off by many ehy: pixdiff bytes: 760977 matching, 25443 off by 1, 12 off by many ell: pixdiff bytes: 764588 matching, 21844 off by 1, 0 off by many sph: pixdiff bytes: 736942 matching, 49490 off by 1, 0 off by many tgc: pixdiff bytes: 783191 matching, 3241 off by 1, 0 off by many tor: pixdiff bytes: 774138 matching, 12294 off by 1, 0 off by many
|RT (EHY)||OCL (EHY)||PIXDIFF (EHY)|
- The off by many problem with EHY is probably related to rounding errors with sqrt in OCL for NVIDIA using a different rounding mode (RTE) than X86 (RTP). I tried to use PTX assembly, i.e.
asm("sqrt.rp.f64 %0, %1;" : "=r"(disc) : "r"(disc));, to solve it but no dice. The code won't run. OCL 1.1 and over have no support for setting rounding modes without using inline assembly.
Week 11 : 3 Aug-9 Aug
- Sean applied patches #341 and #346 to SVN trunk.
- Got SVN write access.
- Applied patches #393 and #370 to SVN trunk.
- SVN trunk now has OCL shot evaluation for SPH, EHY, ELL, ARB8, TOR, TGC primitives.
- Refactored SPH (remove duplicate code, etc) and applied it to trunk.
- Move declarations to top level in order to eliminate duplicate code in trunk.
- Pass struct with primitive data to OCL as an initial step to an AoS device primitive database. Move constants into common.cl.
- Generic OCL solid shot handler. Refactored code to remove duplicates.
- Load large OCL vectors on demand to reduce stack footprint per function call.
- Fix memory leak on OCL loaded program source code.
Week 12 : 10 Aug-16 Aug
- Add inclusive and exclusive scan OCL code from PyOpenCL to trunk.
- Created a private branch for opencl.
- M2 commited to opencl branch: kludge up a simple rendering pipeline with grid spatial partitioning traversal acceleration.
- The simple ANSI C rendering pipeline only supports Lambertian reflection with a stock grey material to make things simpler. Example output for goliath.g:
- For future reference I get these timings for the above scene (one OCL kernel invocation per ray-primitive shot):
- SHOT: cpu = 421.568 sec, elapsed = 447.675 sec
- M4 commited to opencl branch: add device side solid database storage.