Editing User:Vasco.costa/GSoC15/logs

From BRL-CAD

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 18: Line 18:
 
|M1||ELL and ARB8 shot routines in OCL.||#370||'''TRUNK'''
 
|M1||ELL and ARB8 shot routines in OCL.||#370||'''TRUNK'''
 
|-
 
|-
|M2||<s>refactor dispatcher, shoot, optical renderer to process many rays in parallel in C when rendering an image or block.</s>||||''see M5''
+
|M2||refactor dispatcher, shoot, optical renderer to process many rays in parallel in C when rendering an image or block.||||'''BRANCH'''
 
|-
 
|-
|M3.0||<s>grid spatial partitioning in C.</s>||#379||''see M3.2''
+
|M3||grid spatial partitioning in OCL.||#379||BRANCH
 
|-
 
|-
|M3.1||<s>grid spatial partitioning in OCL.</s>||#379||''see M3.2''
+
|M4||GPU side database storage of OCL implemented primitives SPH. EHY, ELL, ARB8.||#392||'''TRUNK'''
 
|-
 
|-
|M3.2||HLBVH object partitioning builder in C. traversal in OCL.||||'''TRUNK'''
+
|M5||port compute intensive or critical parts of the dispatcher, boolean evaluation, optical renderer to OCL.||||BRANCH
|-
 
|M4||GPU side database storage of OCL implemented primitives.||#392||'''TRUNK'''
 
|-
 
|M5||port compute intensive or critical parts of the dispatcher, <s>boolean evaluation</s>, optical renderer to OCL.|| ||'''TRUNK'''
 
|-
 
|M5.1||OCL dispatcher that performs the shot routines for a whole frame.||||'''TRUNK'''
 
|-
 
|M5.2||OCL rasterizer that does the pixel pushing for a whole frame.||||'''TRUNK'''
 
|-
 
|M5.3||OCL lighting modes: Phong, Diffuse, Surface Normals.||||'''TRUNK'''
 
|-
 
|M5.4||OCL lighting modes: Multi-hit transparent.||||'''TRUNK'''
 
 
|-
 
|-
 
|M6||TOR and TGC shot routines in OCL.||#393||'''TRUNK'''
 
|M6||TOR and TGC shot routines in OCL.||#393||'''TRUNK'''
 
|-
 
|-
|M6.1||REC shot routine in OCL.||||'''TRUNK'''
+
|M7||BOT shot routine in OCL.||||''CANCELLED''
|-
 
|M6.2||Surface normal routines for all seven OCL implemented primitives.||||'''TRUNK'''
 
|-
 
|M7||BOT shot routine in OCL.||||-
 
|-
 
|M7.1||Simple BOT shot routine in OCL that computes triangle hits and normals brute force.||||'''TRUNK'''
 
|-
 
|M7.2||CPU HLBVH BOT shot construction with OCL traversal and interpolated per pixel normals.||||'''TRUNK'''
 
 
|}
 
|}
 
<!--
 
<!--
Line 59: Line 39:
 
-->
 
-->
  
The ARB8, ARS, BOT, EHY, ELL, SPH, REC, TOR, TGC, shot routines are in SVN trunk.
+
The ARB8, EHY, ELL, SPH, TOR, TGC, shot routines are in SVN trunk.
 
 
SVN trunk also contains solid database device storage and a render function which given a view2model matrix, width, height, can generate an RGB8 bitmap. Diffuse and Surface Normal light models are supported. The renderer does BVH accelerated ray tracing and ignores the CSG operators. It is integrated as a render option in '''mged'''.
 
  
 
=Development Phase=
 
=Development Phase=
Line 192: Line 170:
 
|}
 
|}
  
* The off by many problem with EHY is probably related to rounding errors with ''sqrt'' in OCL for NVIDIA using a different rounding mode than X86. It is possible to use PTX assembly, i.e. <code>asm("sqrt.rp.f64 %0, %1;" : "=d"(b) : "d"(a));</code>. OCL 1.1 and over have [http://wok.oblomov.eu/tecnologia/gpgpu/opencl-rounding-modes/ no support for setting rounding modes] without using inline assembly.
+
* The off by many problem with EHY is probably related to rounding errors with ''sqrt'' in OCL for NVIDIA using a different rounding mode (RTE) than X86 (RTP). I tried to use PTX assembly, i.e. <code>asm("sqrt.rp.f64 %0, %1;" : "=r"(disc) : "r"(disc));</code>, to solve it but no dice. The code won't run. OCL 1.1 and over have [http://wok.oblomov.eu/tecnologia/gpgpu/opencl-rounding-modes/ no support for setting rounding modes] without using inline assembly.
  
 
===Week 11 : 3 Aug-9 Aug===
 
===Week 11 : 3 Aug-9 Aug===
Line 216: Line 194:
  
 
* ''M2 commited to opencl branch: kludge up a simple rendering pipeline with grid spatial partitioning traversal acceleration.''
 
* ''M2 commited to opencl branch: kludge up a simple rendering pipeline with grid spatial partitioning traversal acceleration.''
: The simple ANSI C rendering pipeline only supports Lambertian reflection with a stock grey material to make things simpler. Golliath scene:
+
: The simple ANSI C rendering pipeline only supports Lambertian reflection with a stock grey material to make things simpler. Example output for ''goliath.g'':
 +
 
 +
:[[File:Cl_goliath.png|512px]]
 +
 
 
<blockquote>
 
<blockquote>
 
{|
 
{|
Line 253: Line 234:
 
|}
 
|}
 
</blockquote>
 
</blockquote>
:This is not an apples to apple comparison since the work done is a lot different. The brute force version ignores the CSG operators.  We don't have OCL normal computation. But it's a way to gauge the possibilities here. If we implemented a BVH we could cut the iterations per pixel from 2429 to around log2(2429)*depth complexity where log2(2429)=11.25. In opencl branch.
+
:This is not an apples to apple comparison since the work done is a lot different. The brute force version ignores the CSG operators.  We don't have OCL normal computation. But it's way to gauge the possibilities here. If we implemented a BVH we could cut the iterations per pixel from 2429 to around log2(2429)=11.25. In opencl branch.
 
 
:PS: The missile launcher tubes don't show up. The tgcs degenerate to recs. So need OCL rec shot to render this properly. Might also be an issue in other scenes.
 
 
 
* Add REC shot routine to opencl branch. Fixes the issues with the havoc missile launchers as seen above.
 
 
 
* Backport REC shot routine from opencl branch to trunk. Add checks for NULL results buffer.
 
 
 
* OCL normal computation for arb8, ehy, ell, rec, sph, tgc, tor.
 
* Backport opencl normal computation for arb8, ehy, ell, rec, sph, tgc, tor to trunk.
 
 
 
* Add diffuse, surface normals rendering light models to opencl branch. Screenshots:
 
<blockquote>
 
{|
 
!'''OCL Sphere (Surface Normals)'''!!'''OCL Sphere (Diffuse)'''!!'''OCL Havoc (Surface Normals)'''
 
|-
 
|[[File:Cl sphere surf normals.png|512px]]||[[File:Cl sphere diffuse.png|512px]]||[[File:Cl havoc surf normals.png|512px]]
 
|-
 
|align="center"|elapsed time: 0.05 sec||align="center"|elapsed time: 0.05 sec||align="center"|elapsed time: 4.20 sec
 
|}
 
</blockquote>
 
 
 
* Backport diffuse, surface normals rendering light models from opencl branch.
 
 
 
* On hindsight I think the grids are not a good option for BRL-CAD on the GPU. Spatial partitioning can result in duplicate shots. Shots of BRL-CAD primitives can be more expensive than simple ray-triangle shots. The alternative is to use mailboxing like the BRL-CAD ANSI C code currently does but this requires a lot of per thread memory which we can ill afford on a massively parallel architecture like a GPU. So I think we would be better served by object partitioning namely BVHs. Did another literature search to see if I could come up with some papers we could use for the boolean evaluation, BVH construction and traversal.
 
 
 
:As for the boolean evaluator. If we can compute this incrementally this will have a significant impact on memory loads and memory consumption.
 
 
 
* Retrofit HLBVH tree builder from pbrtv3 source into opencl branch.
 
* OCL BVH traversal in branch.
 
: For reference the OCL BVH can render the Havoc scene, as seen above, at elapsed time: '''0.09 sec''' vs the 4.20 sec it took with the brute force code. i.e. it is around 45x faster for this scene. The advantage should increase for scenes with more solids.
 
 
 
* The HLBVH code has stabilized enough that I replaced the grids code with it.
 
 
 
* Added bu_pool memory pool allocator because it's useful for the HLBVH builder.
 
* HLBVH CPU builder code has landed on trunk.
 
 
 
* Integrated OCL rendering with '''rt''' (command line option "-z") and '''mged''' (''diffuse'' and ''surface normals'' light models). Currently it is fill rate limited. Pixel pushing is done with '''view_pixel''' on a single CPU core. This should be done on the GPU. The framebuffer outputs should also support writing blocks of pixels. Currently they use line oriented output. Doing these changes would require breaking API compatibility.
 
 
 
===Week 13 : 17 Aug-23 Aug===
 
* Do heavy duty pixel pushing with the GPU. This speeds up rendering of Havok around 2-3x on my system. It should make even more of a difference in simpler scenes which are more fillrate than geometry performance limited. I figured out a way to do the code for this without actually breaking the API. I used a callback to get the framebuffer pointer.
 
 
 
* I redid the accuracy tests after reimplementing the raster parts of the code in OCL to check the accuracy. I got the same accuracy in surface normals mode as when we only computed the hit results in OCL with one kernel invocation per ray-solid intersection.
 
 
 
<blockquote>
 
{|
 
!'''RT Hyperboloid'''!!'''OCL Hyperboloid'''!!'''PIXDIFF Hyperboloid'''
 
|-
 
|[[File:Rt_ehyn.png|256px]]||[[File:Cl_ehyn.png|256px]]||[[File:Diff_ehyn.png|256px]]
 
|-
 
|align="center"|elapsed time @ 972x956: 0.35 sec||align="center"|elapsed time @ 972x956: 0.06 sec||
 
|}
 
</blockquote>
 
 
 
:This was the one primitive which had the most differences last time so I ran the test again. <code>ehy: pixdiff bytes:  760757 matching,  25663 off by 1,      12 off by many</code>. I got similar results. So the pixel engine shouldn't be more innacurate than the regular one. What I did find out in surface normals mode was that the CPU code actually is showing hits with the side of the hyperboloid (see the blue dots in the figure at the left). Despite this view being top down. So maybe the GPU version is actually ''more'' accurate? The differences show a nice noisy pattern without obvious banding or moire so there don't seem to be any major issues with the hits, normals, and raster.
 
 
 
* Show <code>-z</code> OpenCL command line option when running <code>rt -h</code>.
 
 
 
* Rename table.cl to rt.cl.
 
* Replace branches in pixel writing with conditional moves.
 
* Refactor sub buffer code.
 
* Write depth buffer in network byte order.
 
 
 
* Removed scan code from PyOpenCL because of licensing issues. Good thing it wasn't being used anywhere yet.
 
 
 
* Remove malloc inside framebuffer grabber routine.
 
* Require OpenCL 1.2 or greater.
 
* Change OCL primitive packing routines to use memory pools.
 
 
 
* Initial bot, ars implementation. It just intersects all the triangles. No acceleration.
 
* Removed the, now unused, one kernel call per ray-primitive intersection routines.
 
* HLBVH bot construction (experimental) and OCL traversal. Here's a screenshot:
 
<blockquote>
 
{|
 
!'''Buddha (OCL)'''
 
|-
 
|[[File:Cl_buddha.png|512px]]
 
|-
 
|align="center"|1 million triangles
 
|-
 
|align="center"|elapsed time @ 972x956: '''0.14 sec''' (OCL)
 
|-
 
|align="center"|elapsed time @ 972x956: 17.49 sec (RT)
 
|-
 
|align="center"|elapsed time @ 972x956: 0.49 sec (RT bot kd-tree)
 
|}
 
</blockquote>
 
: All math operations are done in double precision FP.
 
* Fix bugs in bot triangle data parsing.
 
* Add gamma correction and haze.
 
* Fix a bug in hlbvh construction in certain edge cases were the primitive bounding boxes are empty.
 
* Experimental bot triangle normal support.
 
* Phong shading lighting model.
 
* Handle UNORDERED, CW, and CCW triangle vertices to fix bot normal generation.
 
 
 
* Added material colors to OCL render. The colors are kind of buggy because there is no easy way, that I know of, getting the actual material associated with a solid in the table. The materials are in regions and regions are the ones with materials. Any solid may be in a number of regions. Figuring out the material without consulting the actual CSG tree which has the regions is hence non-trivial.
 
 
 
* Added a  lightmodel with transparent multi-hit rendering to show the multi-hit facilities.
 
<blockquote>
 
{|
 
!'''Golliath (OCL)'''
 
|-
 
|[[File:Cl_golliath.png|512px]]
 
|-
 
|align="center"|elapsed time @ 972x956: 0.33 sec
 
|}
 
 
 
* Fix linking errors in AMD OCL SDK.
 
* Fix issues with OCL color render.
 
* Fix issue when doing a render with nothing on view.
 
* Set the local workgroup size when rendering to use subgrids up to 8x8 size to maximize coherency of accesses. speeds up things like 2x.
 
 
 
* Tested an adaptation of ''Understanding the Efficiency of Ray Traversal on GPUs. Timo Aila and Samuli Laine, Proc. High-Performance Graphics 2009.'' Was not significantly better on the GTX TITAN compared with just shooting rays in 8x8 blocks. You can read more about it here:
 
** https://sourceforge.net/p/brlcad/patches/416/
 
 
 
=Post Development Phase=
 
=== Week: 24-30 Aug ===
 
* Use less memory to store solid ids and materials.  Eliminate some more branches and simplify logic in solver.
 
* Compute transparency using attenuation.
 
 
 
* bool.c cleanups. If we ever are to port the standard BRL-CAD CSG evaluator algorithm to OpenCL C, given that there seem to be no other major viable options which give sufficiently correct results for our project's purposes, this code must be brought to heel. Such a task would be immense. I hope I helped this with a series of patches to: remove <code>goto</code> (not available in OpenCL C), to re-compile the bool trees (binary tree of pointers) to a linear postfix array form. This form is easier to parse and eval during the rendering stage. I did those tasks in these stages:
 
**eliminated all gotos in <code>rt_default_multioverlap()</code>.
 
**eliminated all gotos in <code>rt_boolweave()</code>.
 
**produced a patch to use the postfix linear tree. I uses a lot less memory (64-bits per node) and the traversal is more cache coherent. The CSG inference engine supports these operators: UNION, INTERSECT, DIFFERENCE, XOR, NOT, SOLID, NOP.
 
::It might require re-interfacing with db code in particular for the way XOR operations used to be treated. I reimplemented these functions to use the postfix bool tree:
 
::<code>rt_tree_max_raynum()</code>, <code>rt_tree_test_ready()</code>, <code>rt_booleval()</code>, <code>rt_solid_bitfinder()</code>.
 
::*https://sourceforge.net/p/brlcad/patches/417/
 
 
 
* Process segments instead of hit points.  Use registers to store segments.  Make all available rendering modes (full, diffuse, normals, multi-hit transparent) work in a single pass. This speeds up the full and transparent modes like 2-3x.
 
* Also updated the multiple-kernel launch renderer code to work with the segment list approach. It might be slower than the single-kernel launch renderer but we might eventually need the whole segment list in memory at the same time to perform more advanced rendering.
 
* Fixed the ocl material colors. It seems a solid's basic material color is in the end rather than the beginning of the regions list it has...
 
 
 
 
 
* Well folks GSoC 2015 is finally over! Mission complete! I thank everyone who made this possible:
 
**Google: Carol Smith
 
**BRL-CAD: brlcad (Sean), Stragus, ``Erik, starseeker.
 
These were the most notable task supporters to list. The deepest thanks go to my parents for tirelessly supporting me during this code marathon.
 

Please note that all contributions to BRL-CAD may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see BRL-CAD:Copyrights for details). Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)