IIT ropar_CUDA_Report_Ankita Dewan

Post on 17-Feb-2017

24 views 1 download

Transcript of IIT ropar_CUDA_Report_Ankita Dewan

PROJECT REPORT

(PROJECT SEMESTER TRAINING)

Object Oriented and Aspect Oriented Programming with Cuda

Submitted by

Ankita Dewan

Roll No. 101053004

Under the Guidance of

Dr. Ashutosh Mishra Dr. Balwinder Sodhi

Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE,

Thapar University, Patiala. IIT Ropar.

Department of Computer Science and Engineering

THAPAR UNIVERSITY, PATIALA

Jan-May 2014

DECLARATION

I hereby declare that the project work entitled “Object Oriented and Aspect Oriented Programming

with Cuda” is an authentic record of my own work carried out at IIT Ropar as requirements of

project semester term for the award of degree of B.E. (Computer Science & Engineering), Thapar

University, Patiala, under the guidance of Dr. Ashutosh Mishra and Dr. Balwinder Sodhi, during

5th Jan to 28th May, 2014.

Ankita Dewan

101053004

Date: 30th

May, 2014

Certified that the above statement made by the student is correct to the best of our knowledge and

belief.

Dr. Ashutosh Mishra Dr. Balwinder Sodhi

Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE,

Thapar University, Patiala. IIT Ropar.

Acknowledgment

I take this opportunity to express my heartfelt gratitude to my mentor Dr. Balwinder Sodhi for his

constant support. His priceless suggestions, ideas and expertise helped me better the quality of my

project. He has been extremely supportive throughout the course of my internship for which I

express my deep and sincere gratitude.

I appreciate all the help and support given to me by my internship colleague Anusha Vangala

from Siddhartha Institute of Technology Vijayavada, Andhra Pradesh and all others from

Computer Science department who helped me avail the numerous facilities.

My acknowledgement would be incomplete without thanking my parents for their constant love

and support and being there by my side through thick and thin.

Abstract

One of the primary aims of computer science is simplification and facilitation. There is a constant

drive to introduce abstraction and/or virtualization so that the primitive building blocks of any

technology are preserved in a constructive and sophisticated manner. Here on, it becomes easier to

add/modify features to the technology.

Performance is an indispensable requirement. In the context of processing and computations,

parallel processing proves to be faster. Technologies like NVIDIA CUDA enable the user to send

C/C++/Fortran code (depending on the technology) straight to GPU with no assembly language

required.

So far, papers and applications mostly in academia/institutes like CERN have experimented and

used this technology to describe how performance of certain algorithms improves by implementing

them on CUDA. GPUs have been targeted for games but here again CUDA has not found its use

on a commercial basis. Many programmers are of the opinion CUDA is not “elegant”. Writing a

"hello world" program in CUDA can be a day of struggle just to get things working. And for

someone who has lesser knowledge of these techniques or wants to not get into the details of it,

simplification, facilitation and convenience must come into the picture.

Our project aims to simplify the manner in which CUDA is presently available with other

techniques that can compliment it without taking away the very essence of it.

Institute Profile

Indian Institute of Technology Ropar, established in 2008, is one of the eight new IITs set up by

the Ministry of Human Resource Development (MHRD), Government of India, to expand the

reach and enhance the quality of technical education in the country.

The institute is committed to providing state-of-the-art technical education in a variety of fields

and also for facilitating transmission of knowledge in keeping with latest developments.

At present, the institute offers Bachelor of Technology (B. Tech.) program in Computer Science

and Engineering, Electrical Engineering, and Mechanical Engineering.

The institute is keen to establish Central Research Facility. PhD program was started so that the

research environment is further augmented, expanded, and made even more vibrant.

My internship under the Department of Computer Science and Engineering helped me appreciate

the value of hands-on training and design. I got to work under excellent facilities.

Nomenclature

GPU……………………………………….……..…………...................... Graphics processing unit

GPGPU………………………………………………….. General purpose graphics processing unit

CUDA……………….............................................................Compute Unified Device Architecture

JCuda………………………………………………………….………………………....Java CUDA

AOP……………………..……………………………………............Aspect-oriented programming

AJC……………………………………………………………………….………..AspectJ Compiler

SPMD…………………………………………………………………Single program, multiple data

ISA………………. …………………………………………………….Instruction Set Architecture

Table of Contents

Chapter 1 Introduction

1.1. Motivation………………………………………………………………………..1

1.2. Problem Statement……………………………………………………………….1

1.3. Work Plan…………………………….................................................................2

Chapter 2 Background

2.1 GPU and CUDA…………………………………………………………………3

2.2 JCuda…………………………………………………………………………….5

2.3 Aspect Oriented Programming and AspectJ……………………………………..6

Chapter 3 Body of Work

3.1 Design……………………………………………………………………………7

3.2 Implementation…………………………………………………………………..10

3.3 Procedure………………………………………………………………………..11

Chapter 4 Related Works

4.1 Alternate Technologies……………………………………………………………16

4.2 Past Projects……………………………………………….…………………….17

Chapter 5 Observation and Findings…………………………………..…………………………18

Chapter 6 Limitations …………………………………………………..………………………..19

Chapter 7 Future Work………………………………………………….……………………….20

Chapter 8 Conclusion……………………………………………………………………………..22

References……………………………………………………………………………..23

Table of figures

Fig 1 CPU is composed of only a few cores that can handle fewer threads at a time.

GPU is composed of many cores that handle thousands of threads simultaneously…….3

Fig 2 CUDA stages…………………………………………………………………………….5

Fig 3 Activity Diagram for computing heterogeneous programs………………………………7

Fig 4 Activity Diagram for CUDA program…………………………………………………...8

Fig 5 Entity Relationship Diagram for CPU, CUDA, JCuda and GPU……………………….9

Fig 6 CUDA Sample Screenshot………………………………………………………….…...11

Fig 7 JCuda Sample Screenshot_1…………………………………………………………….12

Fig 8 JCuda Sample Screenshot_2…………………………………………………………….13

Fig 9 AspectJ Sample Screenshot……………………………………………………………..14

Fig 10 JCuda and Aspectj Sample Screenshot………………………………………………….15

[1]

Chapter 1

Introduction

1.1 Motivation

The likelihood of shifting from traditional CPUs to parallel hybrid platforms, such as Multi-core

CPUs accelerated with heterogeneous GPU co-processing systems, is as much as it was when the

hardware field switched over to Multi-threading and Multi-core CPUs.

Although it is much about the hardware functionality, it does impact the software entities and thus

the programmers. There is a need to modify existing programs such that they can be properly

parallelized to reap benefits of advanced processing architectures.

Nvidia invented CUDA (Compute Unified Device Architecture) as a parallel computing platform

and programming model to increase computing performance by harnessing the power of the

graphics processing unit (GPU). So far so good.

The next desirable move:-

Use technologies which when combined with CUDA can make it easier to use and cater the needs

of a larger domain of programmers/users. In technical terms the idea is to abstract out the details of

GPU computations.

1.2 Problem Statement

CUDA alone leads to tangled source code.

Our source code is a combination of:-

Code for the core kernel computation for device

Code for kernel management by the host; additionally it contains the code for data transfers

between memory spaces, and various optimizations.

In our project, we work on a programming system based on the principles of Object-Oriented and

Aspect-Oriented Programming. The motive is to un-clutter the code to improve programmability.

[2]

1.3 Work Plan

CUDA source code is written entirely in C. That is, the host as well as device code are meted out

in C.

Our approach:-

The Object-Oriented language, Java bindings (JCuda) in this case, is used to handle the host

computations.

The Aspect-Oriented language, AspectJ in this case, is used to encapsulate all other support

functions, such as parallelization granularity and memory access optimization.

The kernel code remains in C as the device code.

In the last stage, aspect compiler (ajc) is used to combine the core Object Oriented program with

aspects to generate parallelized programs.

[3]

Chapter 2

Background

2.1 GPU and CUDA

CPU and GPU are designed differently. While CPU has Latency Oriented Cores a GPU has

Throughput Oriented Cores. CPU is essentially “the master” throughout the operation domain. It

has powerful ALUs but with reduced operation latency and large caches that convert long latency

memory accesses to short latency cache accesses. It has a sophisticated control system. GPU, on

the other hand, has small caches which boost memory throughput. It has energy efficient ALUs

that are heavily pipelined for high throughput and are more in number. It has simpler control

system.

Fig 1

This field is essentially a part of parallel computing but is heterogeneous in nature as it deals with

serial parts and parallel parts.

CUDA achieves parallelism by SPMD. A thread is a virtualized processor. It follows the

instruction cycle (Fetch, Decode, and Execute). For faster execution of arrays of parallel threads,

CUDA kernel is used such that all threads in a grid run the same kernel code.

Each thread has indexes to decide what data to work on and to make control decisions.

i = blockIdx.x * blockDim.x + threadIdx.x

A thread array is divided into multiple blocks which have access to shared memory.

[4]

The two types of memories in CPU-GPU architecture are global memory and shared memory. The

device can read/write shared as well as global memory. The host can transfer data to/from global

memory. Also, the contents of global memory are visible to all the threads of grid. Any thread can

read and write to any location of the global memory. Shared memory is separate for each block of

the grid. Any thread of a block can read and write to the shared memory of that block. A thread in

one block cannot access shared memory of another block. Shared memory is faster to access than

global memory.

A typical CUDA program includes steps as: -

1. Device memory allocation for input and output entities.

Using cudaMalloc() that allocates object in the device global memory and requires 2

parameters

- Address of a pointer to the allocated object

- Size of allocated object in bytes.

2. Copying from host memory to device memory

Using cudaMemcpy() that requires four parameters:-

- Pointer to destination

- Pointer to source

- Number of bytes copied

- Type/Direction of transfer (cudaMemcpyHostToDevice)

3. Launching the Kernel code from Host.

KernelName<<<dimGrid, dimBlock>>>(m, n, k, d_A, d_B, d_C);

4. Copying back output entity from the device memory to host memory

Using cudaMemcpy() with direction of transfer as cudaMemcpyDeviceToHost

5. Free device memory.

- Using cudaFree() that frees object from device global memory.

CUDA Function Declarations:-

• __global__ defines a kernel function that must return void and is callable from host

• __device__ defines kernel function that need not be void; callable from device itself.

• __host__ defines host function with any return type; callable from host itself.

[5]

A few on-going CUDA projects are CudaRasterization, Monte Carlo with CUDA, MD 5 Hash

Crack in CUDA, Cloth Simulation, and Image Tracking etc.

Fig 2

2.2 JCuda

Java is the one of the most commercially used programming language and if preferred by

programmers of all origins. It is class-based and object-oriented. It provides programmers and

developers the option to "write once, run anywhere" (WORA).

Thus, it became a favorable option to bind Java to a library which acts as an application

programming interface (API) and also provides basic code to use that library for CUDA. JCuda

provides all essential Java bindings for the CUDA runtime and driver API. It acts as the base for

all other libraries like JCublas, JCufft etc.

It lets the host interact with a CUDA device. It provides methods which provide the basic steps of

CUDA in a sequential manner. These methods include device management, event management,

memory allocation on the device and copying memory between the device and the host system.

[6]

2.3 Aspect Oriented Programming and AspectJ

Aspect oriented programming aims to modularize 1crosscutting concerns as object−oriented

programming does across with common concerns. It aims to deal with two issues: - scattered code

and tangled code.

The power of OOP diminishes beyond encapsulation, abstraction, polymorphism etc. Here on AOP

addresses the problems by using more manageable modules – aspects. Also, unlike OOP, AOP

does not replace previous programming paradigms. Rather it is complementary to the object-

oriented paradigm and not a replacement.

AspectJ is an implementation of AOP for Java. It adds to Java the following concepts.

Join Point: A well−defined point in the program flow.

Point cut: Construct to select certain join points and values at those points.

- call: identifies any call to the methods defined by object.

- cflow: identifies join points based if they occur in the dynamic context of another pointcut.

- execution: when a particular method body executes.

- target: when the target object is of some parameter type.

- this: when the object currently executing (i.e. this) is of some parameter type .

- within: when the executing code belongs to the class.

Advice: Defines code that is executed when a point cut is reached; dynamic parts of AspectJ.

- Before: Runs when a join point is reached and before the computation proceeds, i.e. it runs

when computation reaches the method call and before the actual method starts running.

- After: Runs after the computation 'under the join point' finishes, i.e. after the method body has

run, and just before control is returned to the caller.

- Around: Runs when the join point is reached, and has explicit control over whether the

computation under the join point is allowed to run at all.

Introduction: Modifies a program's static structure, namely, the members of its classes and the

relationship between classes.

Aspect: AspectJ's unit of modularity for crosscutting concerns; defined in terms of point cuts,

advice and introduction.

1 Logging, authorization, synchronization, error handling and transaction management exemplify crosscutting

concerns because such strategies necessarily affect more than one part of the system. Logging, for instance, crosscuts

all logged classes and methods.

[7]

Chapter 3

Body of work

3.1 Design

Fig 3

[8]

Fig 4

[9]

Fig 5

[10]

3.2 Implementation

We took a bottom-up approach. The focus started with CUDA and then it got shifted to JCuda and

AspectJ separately followed by interlinking JCuda and AspectJ.

CUDA

To begin with, CUDA (v3.1 and beyond) implementation requires a CUDA-enabled Nvidia GPU

card. Our personal machines do not have one and CPU alone cannot perform the computations.

The earlier versions did support device emulation but the compilation can be quite error prone

apart from the poor performance factor. Hence, we relied upon our mentor’s machine that has

GeForce GT 620 driver with following features:

o Global memory - 1022 Mbytes

o 1 Multiprocessor (MP)

o 48 CUDA Cores/ MP

o GPU Clock rate - 1620 MHz

o Total amount of shared memory per block - 49152 bytes

o Maximum number of threads per multiprocessor – 1536

o Maximum number of threads per block – 1024

Operating Environment: Ubuntu 12.10 32-bit installed as guest OS on VMware Player 5.0.2 for

local computations and also to SSH to the remote machine (Ubuntu 12.04 64-bit OS) with the GPU

card.

[11]

3.3 Procedure

3.3.1 CUDA

Fig 6

CUDA toolkit (v5.5) is downloaded and installed using terminal commands.

The PATH and LD_LIBRARY_PATH environment variables are set for CUDA development.

A CUDA program needs to have the .cu file (with the host and device code) and 3 configuration

files (findcudalib.mk, NsightEclipse.xml and MakeFile) placed in the same folder/directory. The

MakeFile must contain the concerned .cu file name.

Upon using ‘make’ command, a .o object file and an executable file are created. Now the A

"CUDA binary" and contains the compiled code that can directly be loaded and executed by a

specific GPU.

[12]

3.3.2 JCuda

Fig 7

JCuda (v0.5.5) libraries have been compiled for CUDA 5.5. We used Binaries for Linux 64bit. It

contains JAR files and SOs of all libraries. jcuda-0.5.5.jar is mostly used for compilation and

running the JCuda applications.

For a minimum JCuda program “jcuda.java” without CUDA kernel code

Compilation: Creates the "jcuda.class" file.

Execution: - Prints the information about pointer created in the program.

[13]

Fig 8

For a full fledged JCuda program “Add.java” with separate CUDA kernel code “AddK.cu”

(Manually)

Compilation: - This kernel code is written exactly in the same way as it is done for CUDA and it

has to be identified and accessed by specifying its name in the source code. It is compiled by the

NVCC compiler to create either a 2PTX file or

3CUBIN file that can be loaded and executed using

the Driver API.

Loading and execution: -

The PTX/CUBIN file has to be loaded, and a pointer to the kernel function has to be obtained

2 A human-readable (but hardly human-understandable) file containing a specific form of "assembler" source code.

3 A "CUDA binary" and contains the compiled code that can directly be loaded and executed by a specific GPU. They

are specific for the Compute Capability of the GPU.

Thus, latest samples prefer the use of PTX files, since they are compiled at runtime for the GPU of the target machine.

[14]

3.3.3 AspectJ

Fig 9

Compilation: - The .java and .aj files are listed in .lst file and –arglist option is used with ajc

Execution: - To run the program, the aspectjrt.jar is included in the classpath and java command is

used.

[15]

3.3.4 JCuda and AspectJ

Fig 10

To speculate the feasibility of AspectJ being compatible with JCuda, JCuda Utility classes JAR

archive were also downloaded. The archive jcudaUtils-0.0.4.jar contains the "KernelLauncher"

class which simplifies the setup and launching of kernels using the JCuda Driver API. It creates

PTX files from inlined source code that is given as a String or from existing CUDA source files.

PTX- or CUBIN files can be loaded and the kernels can be called more conveniently due to

automatic setup of the kernel arguments.

Compilation: - Again the .java, .cu and .aj files are listed in .lst file and –argfile option is used with

ajc command. The source and target are specified along with the classpath of jcuda.jar as well as

aspectjrt.jar

Execution: - To run the program, the aspectjrt.jar and jcuda.jar are included in the classpath and

java command is used.

[16]

Chapter 4

Related Work

4.1 Alternate technologies

4.1.1 Open Computing Language (OpenCL)

- Another framework for writing programs that execute across heterogeneous platforms

consisting of CPUs, GPUs and other processors

- Consists of a language for writing kernels and APIs to define and control the platforms; A

very primitive tool.

- CUDA which is limited to Nvidia hardware and is directly connected to the execution

platform but OpenCL is portable.

- CUDA excels over OpenCL because it outperforms OpenCL when natively ported.

- CUDA has more mature tools like debugger, profiler, CUBLAS and CUFFT.

4.1.2 Aparapi

- An AMD product.

- Converts Java bytecode to OpenCL at runtime and executes either on the GPU or in Java

thread pool.

4.1.3 Rootbeer

- GPU compiler used for CUDA; an alternative for nvcc

4.1.4 Java Annotations

- Introduced in JDK 1.5; Organized data about the code, embedded within the code itself.

- Options: -

@Before – Run before the method execution

@After – Run after the method returned a result

@AfterReturning – Run after the method returned a result intercept the returned result as

well.

@AfterThrowing – Run after the method throws an exception

@Around – Run around the method execution, combine all three advices above.

[17]

- Simpler to use than AspectJ as they do not need load-time weaving or separate complier.

AspectJ needs ajc.

- AspectJ supports all pointcuts. It is a more flexible approach and there is little runtime

overhead. With annotations one can only use method-execution pointcut and there is

more runtime overhead.

4.2 Past Projects

Project Sumatra

- An OpenJDK-backed project

- Primary goal: To enable Java applications to take advantage of graphics processing units

(GPUs) AND accelerated processing units (APUs)--whether they are discrete devices or

integrated with a CPU--to improve performance.

- Approach: Software developers annotate their code to indicate which is suited to the

parallel nature of GPUs. When Java application is run on a system with an OpenCL-

compatible GPU installed, the HotSpot JIT (just-in-time) compiler translates the annotated

bits of code to OpenCL for processing on the GPU rather than the CPU.

- Technical Challenges Solved:

Java allows developers to write once and deploy everywhere and hence its widespread

nature, but one area where it can fall flat is performance. Generally, Java applications

cannot perform as well as native applications written for a specific OS.

- Remaining Technical Challenges

mitigate the complexities of present-day GPU backend and layered standards

build compromise data schemes for both the JVM and GPU hardware

support flatter data structures (Complex values, vector, 2D arrays)

support mix of primitives and JVM-managed pointers

reduce data copying and inter-phase latency between ISA and loop kernels

apply existing technology on MapReduce (to JVM execution of GPU code)

interpret the thread-based Java concurrency model

[18]

Chapter 5

Observation and findings

The two sets of codes written as a part of JCuda and Aspectj perform as good as the original host

code written entirely as a part of JCuda does. Also, there is not much overhead. The interweaving

of code is possible. Thus, the program gets simplified for the generic purposes and for anyone who

wants to bypass device preparation steps.

The device code however continues to be a separate entity written in C. At least with aspect

oriented paradigm it could not be modified into a more easily accessible or a more readily usable

code. Thus, the kernel computation continues to depend on CUDA/JCuda host segment.

[19]

Chapter 6

Limitations

Our project depends on availability of and access to CUDA enabled GeForce, Tesla or

Quadro GPU either on local machine or on some remote machine. Else implementation or

demonstration of any sort will not be possible.

If availability and access is assured, hardware configurations and compatibilities are quite

specific. The compute capability and the 4version of the CUDA driver API have crucial

role. Further the Driver API is backward compatible and hence, mixing and matching

versions will fail to execute. Environment variables need to be accurate for every

tool/technique.

Obtaining output is not straightforward. The in-kernel printf() works like printf() of

traditional C. It is executed as other device-side functions, i.e. in a per thread manner.

However if it is a multi-threaded kernel, printf() will be executed by every thread, using

specified thread data.

The problem arises with the fact that final formatting of the printf() output has to take

place on the host. The format string must be understood by the host-system’s compiler

and C library. Although efforts have been made so that the format specifiers supported by

CUDA’s printf() form a universal subset from the most common host compilers, but exact

behavior is always host-O/S-dependent.

o 4 All applications, plug-ins, and libraries on a system must use the same version of the

CUDA driver API, since only one version of the CUDA device driver can be installed on a

system.

o All plug-ins and libraries used by an application must use the same version of the runtime.

o All plug-ins and libraries used by an application must use the same version of any libraries

that use the runtime (such as CUFFT, CUBLAS)

[20]

Chapter 7

Future Work

Parallel programming models are need of the hour but they tend to have a somewhat unpredictable

shelf-life. As the hardware platforms underneath them change so rapidly as per the trends, so it

becomes tough to speculate on precise future of CUDA as it looks today. Nevertheless, much

research is being done in this field.

It is an era where almost every technology and every idea has found or is finding its way to Cloud

Computing. With Internet access availability in the bigger picture, anything that has to do with

data storage, manipulation and computation can eventually become a part of "dynamic web".

So when the check list will cover abstraction, simplification, optimization etc, scalability and

availability are the features that might bring CUDA more commercial success.

A cloud-based machine with GPU is as good as a local or remote machine with GPU. Hadoop, a

widely-used MapReduce framework, has already been combined with AMD Aparapi. On similar

lines, the scope of on-going/future CUDA projects can be to have an easy-to-use API which allows

easy implementation of MapReduce algorithms that make use of the GPU. Abstraction can again

be a part of this combination as the API can serve dual purpose of hiding the complexity of GPU

programming and leveraging the numerous benefits of Cloud.

Thus, beyond single-GPU development, efforts in this direction can be extended to the domain of

GPU clusters. For instance the project GPMR has taken up this idea in its body of work.

Synopsis:

MapReduce is the toolset deployed for large-dataset processing. As with a regular MapReduce

model, the data-parallel processing is handled by GPUs.

The existing GPU-MapReduce (GPMR library) work targets solo GPUs. Unlike CPUs, GPUs

cannot source or sink network or I/O sources.

[21]

Scope and Possible Implementation:

- Specific extensions for the GPU, including batching Maps and Reduces via Chunking to

maintain GPU utilization

- Adding accumulation to the Map sub stage

- Adding a Partial Reduction sub stage

- Assembling the MapReduce pipeline to achieve a high overlap of communication and

computation.

Areas of concern:

- Programming multiGPU clusters lacks powerful toolsets and APIs.

- GPU is often treated as a slave device in most GPU-computing applications.

- GPMR is stand-alone and does not sit atop Hadoop or another MapReduce package. It does

not handle fault tolerance. It does not provide a distributed file system (Hadoop Distributed

File System to be precise).

[22]

Chapter 8

Conclusion

The primary aim of the project, which was to speculate the feasibility of breaking down existing

code into two entities and still get accurate results, is served well. The primitive idea of achieving

parallelism with CUDA has now matured into a more sophisticated idea. Paradigms like Object

Oriented Programming and Aspect Oriented Programming have graciously complimented CUDA

without diminishing the power of this technology.

The way JCuda has brought commercial success to CUDA; products like PyCuda have done the

same by rendering flavors of other paradigms like Multi-paradigm approach encompassing object-

oriented, imperative, procedural and reflective programming.

FORTRAN CUDA, CUDA.NET, KappaCUDA. Examples abound.

The list of programming paradigms, compiling/weaving tools, cloud computing techniques and

other existing techniques is exhaustive. Further, CUDA is not the only technology in parallel

computing race. Thus to conform to software quality metrics and to be certified as the 'fitness for

purpose', any technique, in its full fledged form, will have to undergo experimentation. Every

permutation and combination will contribute to this field.

[23]

References

Websites: -

1. Official NVIDIA CUDA Home Page

http://www.nvidia.in/object/cuda_home_new.html

2. Official Eclipse AspectJ Home Page

https://www.eclipse.org/aspectj/doc/next/devguide/ajc-ref.html

3. Official JCuda Home Page

http://www.jcuda.org/tutorial/TutorialIndex.html

Journals/Research Papers: -

1. Aspect-Oriented Programming Beyond Dependency Injection By higeru Chiba and Rei

Ishikawa, Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology

(2008)

2. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA By

Yonghong Yan, Max Grossman, and Vivek Sarkar, Department of Computer Science, Rice

University (2009)

3. MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture By

Reza Farivar, Abhishek Verma, Ellick M. Chan, Roy H. Campbell, Department of

Computer Science, University of Illinois at Urbana-Champaign 201 N Goodwin Ave,

Urbana, IL 61801-2302.

4. Tangling and scattering By Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris

Maeda, Cristina Lopes, Jean-Marc Loingtier and John Irwin, Xerox Palo Alto Research

Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA.