diff --git a/.github/workflows/github-pages.yml b/.github/workflows/github-pages.yml index f5d4a3212e..135a39cefe 100644 --- a/.github/workflows/github-pages.yml +++ b/.github/workflows/github-pages.yml @@ -8,12 +8,18 @@ on: workflow_dispatch: # schedule: -# - cron: '55 13 * * *' +# - cron: '55 13 * * *' + +# Top-level default; empty/no permissions +permissions: {} jobs: pages: name: Build GitHub Pages - runs-on: ubuntu-latest + runs-on: ubuntu-latest + permissions: + contents: write + # Above required for publishing to gh-pages; see auth on Ln 67-68 steps: - name: Set up Python diff --git a/AI-and-Analytics/Features-and-Functionality/IntelPyTorch_GPU_InferenceOptimization_with_AMP/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelPyTorch_GPU_InferenceOptimization_with_AMP/sample.json index bedd0f0099..e01073f174 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelPyTorch_GPU_InferenceOptimization_with_AMP/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelPyTorch_GPU_InferenceOptimization_with_AMP/sample.json @@ -18,8 +18,7 @@ "pip install -r requirements.txt", "pip install jupyter ipykernel", "python -m ipykernel install --user --name=pytorch-gpu", - "python IntelPyTorch_GPU_InferenceOptimization_with_AMP.py", - "jupyter nbconvert --to notebook --execute IntelPyTorch_GPU_InferenceOptimization_with_AMP.ipynb" + "python IntelPyTorch_GPU_InferenceOptimization_with_AMP.py" ] } ] diff --git a/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.ipynb b/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.ipynb index a9505721f4..23cb746893 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.ipynb +++ b/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.ipynb @@ -8,7 +8,7 @@ "source": [ "# =============================================================\n", "# Copyright © 2022 Intel Corporation\n", - "# \n", + "#\n", "# SPDX-License-Identifier: MIT\n", "# =============================================================" ] @@ -19,9 +19,9 @@ "source": [ "# Simple k-NN classification with numba_dpex IDP optimization\n", "\n", - "This sample shows how to recieve the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computetaion are performed using wine dataset.\n", + "This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computation are performed using wine dataset.\n", "\n", - "Let's start with general inports used in the whole sample." + "Let's start with general imports used in the whole sample." ] }, { @@ -58,12 +58,12 @@ "from sklearn.datasets import load_wine\n", "\n", "data = load_wine()\n", - "# Convert loaded dataset to DataFrame \n", + "# Convert loaded dataset to DataFrame\n", "df = pd.DataFrame(data=data.data, columns=data.feature_names)\n", - "df['target'] = pd.Series(data.target)\n", + "df[\"target\"] = pd.Series(data.target)\n", "\n", - "# Limit features to 2 selected for this problem \n", - "df = df[['target', 'alcohol', 'malic_acid']]\n", + "# Limit features to 2 selected for this problem\n", + "df = df[[\"target\", \"alcohol\", \"malic_acid\"]]\n", "\n", "# Show top 5 values from the limited dataset\n", "df.head()" @@ -89,7 +89,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The next step is to prepare the dataset for training and testing. To do this, we randomly divided the downloaded wine dataset into a training set (containing 90% of the data) and a test set (containing 10% of the data). \n", + "The next step is to prepare the dataset for training and testing. To do this, we randomly divided the downloaded wine dataset into a training set (containing 90% of the data) and a test set (containing 10% of the data).\n", "\n", "In addition, we take from both sets (training and test) data *X* (features) and label *y* (target)." ] @@ -103,12 +103,14 @@ "outputs": [], "source": [ "# we are using 10% of the data for the testing purpose\n", - "train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False)\n", + "train_sample_idx = np.random.choice(\n", + " df.index, size=int(df.shape[0] * 0.9), replace=False\n", + ")\n", "train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx)\n", "\n", "# get features and label from train/test data\n", - "X_train, y_train = train_data.drop('target', axis=1), train_data['target']\n", - "X_test, y_test = test_data.drop('target', axis=1), test_data['target']" + "X_train, y_train = train_data.drop(\"target\", axis=1), train_data[\"target\"]\n", + "X_test, y_test = test_data.drop(\"target\", axis=1), test_data[\"target\"]" ] }, { @@ -117,9 +119,9 @@ "source": [ "## NumPy k-NN\n", "\n", - "Now, it's time to implenet the first version of k-NN function using NumPy.\n", + "Now, it's time to implement the first version of k-NN function using NumPy.\n", "\n", - "First, let's create simple euqlidesian distance function. We are taking positions form the provided vectors, counting the squares of the individual differences between the positions, and then drawing the root of their sum for the whole vectors (remember that the vectors must be of equal length)." + "First, let's create simple euclidean distance function. We are taking positions form the provided vectors, counting the squares of the individual differences between the positions, and then drawing the root of their sum for the whole vectors (remember that the vectors must be of equal length)." ] }, { @@ -129,7 +131,7 @@ "outputs": [], "source": [ "def distance(vector1, vector2):\n", - " dist = [(a - b)**2 for a, b in zip(vector1, vector2)]\n", + " dist = [(a - b) ** 2 for a, b in zip(vector1, vector2)]\n", " dist = math.sqrt(sum(dist))\n", " return dist" ] @@ -138,14 +140,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Then, the k-nearest neighbour algorithm itself.\n", + "Then, the k-nearest neighbors algorithm itself.\n", "\n", "1. We are starting by defining a container for predictions the same size as a test set.\n", "2. Then, for each row in the test set, we calculate distances between then and every training record.\n", "3. We are sorting training datasets based on calculated distances\n", "4. Choose k of the first elements in the sorted training list.\n", "5. We are counting labels appearances\n", - "6. The most common label is set as a prediction. " + "6. The most common label is set as a prediction." ] }, { @@ -158,26 +160,26 @@ " # 1. Prepare container for predictions\n", " predictions = np.zeros(X_test.shape[0])\n", " X_test.reset_index(drop=True, inplace=True)\n", - " \n", + "\n", " for index, row in X_test.iterrows():\n", " # 2. Calculate distances\n", " inputs = X_train.copy()\n", - " inputs['distance'] = inputs.apply(distance, vector2=row, axis=1)\n", + " inputs[\"distance\"] = inputs.apply(distance, vector2=row, axis=1)\n", " inputs = pd.concat([inputs, y_train], axis=1)\n", - " \n", + "\n", " # 3. Sort based on distance\n", - " inputs = inputs.sort_values('distance', ascending=True)\n", - " \n", + " inputs = inputs.sort_values(\"distance\", ascending=True)\n", + "\n", " # 4. Choose k if the first elements\n", " neighbors = inputs.head(k)\n", - " classes = neighbors['target'].tolist()\n", - " \n", + " classes = neighbors[\"target\"].tolist()\n", + "\n", " # 5. Count labels appearances\n", " majority_count = Counter(classes)\n", - " \n", + "\n", " # 6. Choose most common label\n", " predictions[index] = majority_count.most_common(1).pop()[0]\n", - " \n", + "\n", " return predictions" ] }, @@ -199,7 +201,7 @@ "predictions = knn(X_train, y_train, X_test, 3)\n", "true_values = y_test.to_numpy()\n", "accuracy = np.mean(predictions == true_values)\n", - "print('Numpy accuracy:', accuracy)" + "print(\"Numpy accuracy:\", accuracy)" ] }, { @@ -227,7 +229,7 @@ "\n", "@numba.jit(nopython=True)\n", "def euclidean_distance_numba(vector1, vector2):\n", - " dist = np.linalg.norm(vector1-vector2)\n", + " dist = np.linalg.norm(vector1 - vector2)\n", " return dist" ] }, @@ -249,15 +251,15 @@ " # 1. Prepare container for predictions\n", " predictions = np.zeros(X_test.shape[0])\n", " for x in np.arange(X_test.shape[0]):\n", - " \n", + "\n", " # 2. Calculate distances\n", " inputs = X_train.copy()\n", " distances = np.zeros((inputs.shape[0], 1))\n", " for i in np.arange(inputs.shape[0]):\n", " distances[i] = euclidean_distance_numba(inputs[i], X_test[x])\n", - " \n", + "\n", " labels = y_train.copy()\n", - " labels = labels.reshape((labels.shape[0],1))\n", + " labels = labels.reshape((labels.shape[0], 1))\n", "\n", " # add labels column\n", " inputs = np.hstack((inputs, labels))\n", @@ -265,7 +267,7 @@ " inputs = np.hstack((inputs, distances))\n", "\n", " # 3. Sort based on distance\n", - " inputs = inputs[inputs[:,3].argsort()]\n", + " inputs = inputs[inputs[:, 3].argsort()]\n", " # 4. Choose k if the first elements\n", " # 2nd columns contains classes, select first k values\n", " neighbor_classes = inputs[:, 2][:k]\n", @@ -278,7 +280,7 @@ " else:\n", " counter[item] = 1\n", " counter_sorted = sorted(counter)\n", - " \n", + "\n", " # 6. Choose most common label\n", " predictions[x] = counter_sorted[0]\n", " return predictions" @@ -303,7 +305,7 @@ "predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3)\n", "true_values = y_test.to_numpy()\n", "accuracy = np.mean(predictions == true_values)\n", - "print('Numba accuracy:', accuracy)" + "print(\"Numba accuracy:\", accuracy)" ] }, { @@ -314,7 +316,7 @@ "\n", "Numba_dpex implementation use `numba_dpex.kernel()` decorator. For more information about programming, SYCL kernels go to: https://intelpython.github.io/numba-dpex/latest/user_guides/kernel_programming_guide/index.html.\n", "\n", - "Calculating distance is like in the NumPy example. We are using Euclidean distance. Later, we create the queue of the neighbours by the calculated distance and count in provided *k* votes for dedicated classes of neighbours.\n", + "Calculating distance is like in the NumPy example. We are using Euclidean distance. Later, we create the queue of the neighbors by the calculated distance and count in provided *k* votes for dedicated classes of neighbors.\n", "\n", "In the end, we are taking a class that achieves the maximum value of votes and setting it for the current global iteration." ] @@ -328,22 +330,30 @@ "import numba_dpex\n", "\n", "@numba_dpex.kernel\n", - "def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_lst):\n", + "def knn_numba_dpex(\n", + " train,\n", + " train_labels,\n", + " test,\n", + " k,\n", + " predictions,\n", + " votes_to_classes_lst,\n", + "):\n", + " dtype = train.dtype\n", " i = numba_dpex.get_global_id(0)\n", - " queue_neighbors = numba_dpex.private.array(shape=(3, 2), dtype=np.float64)\n", - " \n", + " queue_neighbors = numba_dpex.private.array(shape=(3, 2), dtype=dtype)\n", + "\n", " for j in range(k):\n", - " x1 = train[j][0]\n", - " x2 = test[i][0]\n", + " x1 = train[j, 0]\n", + " x2 = test[i, 0]\n", "\n", - " distance = 0.0\n", + " distance = dtype.type(0.0)\n", " diff = x1 - x2\n", " distance += diff * diff\n", " dist = math.sqrt(distance)\n", "\n", " queue_neighbors[j, 0] = dist\n", " queue_neighbors[j, 1] = train_labels[j]\n", - " \n", + "\n", " for j in range(k):\n", " new_distance = queue_neighbors[j, 0]\n", " new_neighbor_label = queue_neighbors[j, 1]\n", @@ -359,17 +369,17 @@ " queue_neighbors[index, 1] = new_neighbor_label\n", "\n", " for j in range(k, len(train)):\n", - " x1 = train[j][0]\n", - " x2 = test[i][0]\n", + " x1 = train[j, 0]\n", + " x2 = test[i, 0]\n", "\n", - " distance = 0.0\n", + " distance = dtype.type(0.0)\n", " diff = x1 - x2\n", " distance += diff * diff\n", " dist = math.sqrt(distance)\n", - " \n", - " if dist < queue_neighbors[k - 1][0]:\n", - " queue_neighbors[k - 1][0] = dist\n", - " queue_neighbors[k - 1][1] = train_labels[j]\n", + "\n", + " if dist < queue_neighbors[k - 1, 0]:\n", + " queue_neighbors[k - 1, 0] = dist\n", + " queue_neighbors[k - 1, 1] = train_labels[j]\n", " new_distance = queue_neighbors[k - 1, 0]\n", " new_neighbor_label = queue_neighbors[k - 1, 1]\n", " index = k - 1\n", @@ -389,7 +399,7 @@ " votes_to_classes[int(queue_neighbors[j, 1])] += 1\n", "\n", " max_ind = 0\n", - " max_value = 0\n", + " max_value = dtype.type(0)\n", "\n", " for j in range(3):\n", " if votes_to_classes[j] > max_value:\n", @@ -407,7 +417,7 @@ "\n", "In this case, we will need to provide the container for predictions: `predictions` and the container for votes per class: `votes_to_classes_lst` (the container size is 3, as we have 3 classes in our dataset).\n", "\n", - "We are running a prepared k-NN function using `dctl.device_context()`, which allows us to select a device. For more information, go to: https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html." + "We are running a prepared k-NN function on a CPU device as the input data was allocated on the CPU. Numba-dpex will infer the execution queue based on where the input arguments to the kernel were allocated. Refer: https://intelpython.github.io/oneAPI-for-SciPy/details/programming_model/#compute-follows-data" ] }, { @@ -416,18 +426,19 @@ "metadata": {}, "outputs": [], "source": [ - "import dpctl\n", + "import dpnp\n", "\n", - "predictions = dpctl.tensor.empty(len(X_test.values))\n", + "predictions = dpnp.empty(len(X_test.values), device=\"cpu\")\n", "# we have 3 classes\n", - "votes_to_classes_lst = dpctl.tensor.zeros((len(X_test.values), 3))\n", + "votes_to_classes_lst = dpnp.zeros((len(X_test.values), 3), device=\"cpu\")\n", "\n", - "X_train_dpt = dpctl.tensor.asarray(X_train.values)\n", - "y_train_dpt = dpctl.tensor.asarray(y_train.values)\n", - "X_test_dpt = dpctl.tensor.asarray(X_test.values)\n", + "X_train_dpt = dpnp.asarray(X_train.values, device=\"cpu\")\n", + "y_train_dpt = dpnp.asarray(y_train.values, device=\"cpu\")\n", + "X_test_dpt = dpnp.asarray(X_test.values, device=\"cpu\")\n", "\n", - "with dpctl.device_context(\"opencl:cpu:0\"):\n", - " knn_numba_dpex[numba_dpex.Range(len(X_test.values))](X_train_dpt, y_train_dpt, X_test_dpt, 3, predictions, votes_to_classes_lst)" + "knn_numba_dpex[numba_dpex.Range(len(X_test.values))](\n", + " X_train_dpt, y_train_dpt, X_test_dpt, 3, predictions, votes_to_classes_lst\n", + ")" ] }, { @@ -443,10 +454,10 @@ "metadata": {}, "outputs": [], "source": [ - "predictions_numba = dpctl.tensor.to_numpy(predictions)\n", + "predictions_numba = dpnp.asnumpy(predictions)\n", "true_values = y_test.to_numpy()\n", "accuracy = np.mean(predictions_numba == true_values)\n", - "print('Numba_dpex accuracy:', accuracy)" + "print(\"Numba_dpex accuracy:\", accuracy)" ] }, { diff --git a/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.py b/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.py index e3411f8d5b..5361129083 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.py +++ b/AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpex_kNN/IntelPython_Numpy_Numba_dpex_kNN.py @@ -6,16 +6,16 @@ # ============================================================= # Copyright © 2022 Intel Corporation -# +# # SPDX-License-Identifier: MIT # ============================================================= # # Simple k-NN classification with numba_dpex IDP optimization -# -# This sample shows how to recieve the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computetaion are performed using wine dataset. -# -# Let's start with general inports used in the whole sample. +# +# This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computation are performed using wine dataset. +# +# Let's start with general imports used in the whole sample. # In[ ]: @@ -27,11 +27,11 @@ # ## Data preparation -# +# # Then, let's download the dataset and prepare it for future computations. -# +# # We are using the wine dataset available in the sci-kit learn library. For our purposes, we will be using only 2 features: alcohol and malic_acid. -# +# # So first we need to load the dataset and create DataFrame from it. Later we will limit the DataFrame to just target and 2 classes we choose for this problem. # In[ ]: @@ -40,12 +40,12 @@ from sklearn.datasets import load_wine data = load_wine() -# Convert loaded dataset to DataFrame +# Convert loaded dataset to DataFrame df = pd.DataFrame(data=data.data, columns=data.feature_names) -df['target'] = pd.Series(data.target) +df["target"] = pd.Series(data.target) -# Limit features to 2 selected for this problem -df = df[['target', 'alcohol', 'malic_acid']] +# Limit features to 2 selected for this problem +df = df[["target", "alcohol", "malic_acid"]] # Show top 5 values from the limited dataset df.head() @@ -59,45 +59,47 @@ np.random.seed(42) -# The next step is to prepare the dataset for training and testing. To do this, we randomly divided the downloaded wine dataset into a training set (containing 90% of the data) and a test set (containing 10% of the data). -# +# The next step is to prepare the dataset for training and testing. To do this, we randomly divided the downloaded wine dataset into a training set (containing 90% of the data) and a test set (containing 10% of the data). +# # In addition, we take from both sets (training and test) data *X* (features) and label *y* (target). # In[ ]: # we are using 10% of the data for the testing purpose -train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False) +train_sample_idx = np.random.choice( + df.index, size=int(df.shape[0] * 0.9), replace=False +) train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx) # get features and label from train/test data -X_train, y_train = train_data.drop('target', axis=1), train_data['target'] -X_test, y_test = test_data.drop('target', axis=1), test_data['target'] +X_train, y_train = train_data.drop("target", axis=1), train_data["target"] +X_test, y_test = test_data.drop("target", axis=1), test_data["target"] # ## NumPy k-NN -# -# Now, it's time to implenet the first version of k-NN function using NumPy. -# -# First, let's create simple euqlidesian distance function. We are taking positions form the provided vectors, counting the squares of the individual differences between the positions, and then drawing the root of their sum for the whole vectors (remember that the vectors must be of equal length). +# +# Now, it's time to implement the first version of k-NN function using NumPy. +# +# First, let's create simple euclidean distance function. We are taking positions form the provided vectors, counting the squares of the individual differences between the positions, and then drawing the root of their sum for the whole vectors (remember that the vectors must be of equal length). # In[ ]: def distance(vector1, vector2): - dist = [(a - b)**2 for a, b in zip(vector1, vector2)] + dist = [(a - b) ** 2 for a, b in zip(vector1, vector2)] dist = math.sqrt(sum(dist)) return dist -# Then, the k-nearest neighbour algorithm itself. -# +# Then, the k-nearest neighbors algorithm itself. +# # 1. We are starting by defining a container for predictions the same size as a test set. # 2. Then, for each row in the test set, we calculate distances between then and every training record. # 3. We are sorting training datasets based on calculated distances # 4. Choose k of the first elements in the sorted training list. # 5. We are counting labels appearances -# 6. The most common label is set as a prediction. +# 6. The most common label is set as a prediction. # In[ ]: @@ -106,26 +108,26 @@ def knn(X_train, y_train, X_test, k): # 1. Prepare container for predictions predictions = np.zeros(X_test.shape[0]) X_test.reset_index(drop=True, inplace=True) - + for index, row in X_test.iterrows(): # 2. Calculate distances inputs = X_train.copy() - inputs['distance'] = inputs.apply(distance, vector2=row, axis=1) + inputs["distance"] = inputs.apply(distance, vector2=row, axis=1) inputs = pd.concat([inputs, y_train], axis=1) - + # 3. Sort based on distance - inputs = inputs.sort_values('distance', ascending=True) - + inputs = inputs.sort_values("distance", ascending=True) + # 4. Choose k if the first elements neighbors = inputs.head(k) - classes = neighbors['target'].tolist() - + classes = neighbors["target"].tolist() + # 5. Count labels appearances majority_count = Counter(classes) - + # 6. Choose most common label predictions[index] = majority_count.most_common(1).pop()[0] - + return predictions @@ -139,15 +141,15 @@ def knn(X_train, y_train, X_test, k): predictions = knn(X_train, y_train, X_test, 3) true_values = y_test.to_numpy() accuracy = np.mean(predictions == true_values) -print('Numpy accuracy:', accuracy) +print("Numpy accuracy:", accuracy) # ## Numba k-NN -# +# # Now, let's move to the numba implementation of the k-NN algorithm. We will start the same, by defining the distance function and importing the necessary packages. -# +# # For numba implementation, we are using the core functionality which is `numba.jit()` decorator. -# +# # We are starting with defining the distance function. Like before it is a euclidean distance. For additional optimization we are using `np.linalg.norm`. # In[ ]: @@ -157,7 +159,7 @@ def knn(X_train, y_train, X_test, k): @numba.jit(nopython=True) def euclidean_distance_numba(vector1, vector2): - dist = np.linalg.norm(vector1-vector2) + dist = np.linalg.norm(vector1 - vector2) return dist @@ -171,15 +173,15 @@ def knn_numba(X_train, y_train, X_test, k): # 1. Prepare container for predictions predictions = np.zeros(X_test.shape[0]) for x in np.arange(X_test.shape[0]): - + # 2. Calculate distances inputs = X_train.copy() distances = np.zeros((inputs.shape[0], 1)) for i in np.arange(inputs.shape[0]): distances[i] = euclidean_distance_numba(inputs[i], X_test[x]) - + labels = y_train.copy() - labels = labels.reshape((labels.shape[0],1)) + labels = labels.reshape((labels.shape[0], 1)) # add labels column inputs = np.hstack((inputs, labels)) @@ -187,7 +189,7 @@ def knn_numba(X_train, y_train, X_test, k): inputs = np.hstack((inputs, distances)) # 3. Sort based on distance - inputs = inputs[inputs[:,3].argsort()] + inputs = inputs[inputs[:, 3].argsort()] # 4. Choose k if the first elements # 2nd columns contains classes, select first k values neighbor_classes = inputs[:, 2][:k] @@ -200,14 +202,14 @@ def knn_numba(X_train, y_train, X_test, k): else: counter[item] = 1 counter_sorted = sorted(counter) - + # 6. Choose most common label predictions[x] = counter_sorted[0] return predictions -# Similarly, as in the NumPy example, we are testing implemented method for the `k = 3`. -# +# Similarly, as in the NumPy example, we are testing implemented method for the `k = 3`. +# # The accuracy of the method is the same as in the NumPy implementation. # In[ ]: @@ -217,15 +219,15 @@ def knn_numba(X_train, y_train, X_test, k): predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3) true_values = y_test.to_numpy() accuracy = np.mean(predictions == true_values) -print('Numba accuracy:', accuracy) +print("Numba accuracy:", accuracy) # ## Numba_dpex k-NN -# +# # Numba_dpex implementation use `numba_dpex.kernel()` decorator. For more information about programming, SYCL kernels go to: https://intelpython.github.io/numba-dpex/latest/user_guides/kernel_programming_guide/index.html. -# -# Calculating distance is like in the NumPy example. We are using Euclidean distance. Later, we create the queue of the neighbours by the calculated distance and count in provided *k* votes for dedicated classes of neighbours. -# +# +# Calculating distance is like in the NumPy example. We are using Euclidean distance. Later, we create the queue of the neighbors by the calculated distance and count in provided *k* votes for dedicated classes of neighbors. +# # In the end, we are taking a class that achieves the maximum value of votes and setting it for the current global iteration. # In[ ]: @@ -234,22 +236,30 @@ def knn_numba(X_train, y_train, X_test, k): import numba_dpex @numba_dpex.kernel -def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_lst): +def knn_numba_dpex( + train, + train_labels, + test, + k, + predictions, + votes_to_classes_lst, +): + dtype = train.dtype i = numba_dpex.get_global_id(0) - queue_neighbors = numba_dpex.private.array(shape=(3, 2), dtype=np.float64) - + queue_neighbors = numba_dpex.private.array(shape=(3, 2), dtype=dtype) + for j in range(k): - x1 = train[j][0] - x2 = test[i][0] + x1 = train[j, 0] + x2 = test[i, 0] - distance = 0.0 + distance = dtype.type(0.0) diff = x1 - x2 distance += diff * diff dist = math.sqrt(distance) queue_neighbors[j, 0] = dist queue_neighbors[j, 1] = train_labels[j] - + for j in range(k): new_distance = queue_neighbors[j, 0] new_neighbor_label = queue_neighbors[j, 1] @@ -265,17 +275,17 @@ def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_l queue_neighbors[index, 1] = new_neighbor_label for j in range(k, len(train)): - x1 = train[j][0] - x2 = test[i][0] + x1 = train[j, 0] + x2 = test[i, 0] - distance = 0.0 + distance = dtype.type(0.0) diff = x1 - x2 distance += diff * diff dist = math.sqrt(distance) - - if dist < queue_neighbors[k - 1][0]: - queue_neighbors[k - 1][0] = dist - queue_neighbors[k - 1][1] = train_labels[j] + + if dist < queue_neighbors[k - 1, 0]: + queue_neighbors[k - 1, 0] = dist + queue_neighbors[k - 1, 1] = train_labels[j] new_distance = queue_neighbors[k - 1, 0] new_neighbor_label = queue_neighbors[k - 1, 1] index = k - 1 @@ -295,7 +305,7 @@ def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_l votes_to_classes[int(queue_neighbors[j, 1])] += 1 max_ind = 0 - max_value = 0 + max_value = dtype.type(0) for j in range(3): if votes_to_classes[j] > max_value: @@ -306,39 +316,26 @@ def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_l # Next, like before, let's test the prepared k-NN function. -# +# # In this case, we will need to provide the container for predictions: `predictions` and the container for votes per class: `votes_to_classes_lst` (the container size is 3, as we have 3 classes in our dataset). -# -# We are running a prepared k-NN function using `dctl.device_context()`, which allows us to select a device. For more information, go to: https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html. - +# +# We are running a prepared k-NN function on a CPU device as the input data was allocated on the CPU. Numba-dpex will infer the execution queue based on where the input arguments to the kernel were allocated. Refer: https://intelpython.github.io/oneAPI-for-SciPy/details/programming_model/#compute-follows-data # In[ ]: -import dpctl +import dpnp -predictions = dpctl.tensor.empty(len(X_test.values)) +predictions = dpnp.empty(len(X_test.values), device="cpu") # we have 3 classes -votes_to_classes_lst = dpctl.tensor.zeros((len(X_test.values), 3)) +votes_to_classes_lst = dpnp.zeros((len(X_test.values), 3), device="cpu") -X_train_dpt = dpctl.tensor.asarray(X_train.values) -y_train_dpt = dpctl.tensor.asarray(y_train.values) -X_test_dpt = dpctl.tensor.asarray(X_test.values) +X_train_dpt = dpnp.asarray(X_train.values, device="cpu") +y_train_dpt = dpnp.asarray(y_train.values, device="cpu") +X_test_dpt = dpnp.asarray(X_test.values, device="cpu") -d = dpctl.SyclDevice("gpu") -if (d.has_aspect_fp64 == False): - print("Double precision floating points not supported on this Device. Exiting!\n") - -else: - with dpctl.device_context(d): - knn_numba_dpex[numba_dpex.Range(len(X_test.values))](X_train_dpt, y_train_dpt, X_test_dpt, 3, predictions, votes_to_classes_lst) - predictions_numba = dpctl.tensor.to_numpy(predictions) - true_values = y_test.to_numpy() - accuracy = np.mean(predictions_numba == true_values) - print('Numba_dpex accuracy:', accuracy) - print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]") - -#with dpctl.device_context("opencl:cpu:0"): - #knn_numba_dpex[numba_dpex.Range(len(X_test.values))](X_train_dpt, y_train_dpt, X_test_dpt, 3, predictions, votes_to_classes_lst) +knn_numba_dpex[numba_dpex.Range(len(X_test.values))]( + X_train_dpt, y_train_dpt, X_test_dpt, 3, predictions, votes_to_classes_lst +) # Like before, let's measure the accuracy of the prepared implementation. It is measured as the number of well-assigned classes for the test set. The final result is the same for all: NumPy, numba and numba-dpex implementations. @@ -346,11 +343,13 @@ def knn_numba_dpex(train, train_labels, test, k, predictions, votes_to_classes_l # In[ ]: - +predictions_numba = dpnp.asnumpy(predictions) +true_values = y_test.to_numpy() +accuracy = np.mean(predictions_numba == true_values) +print("Numba_dpex accuracy:", accuracy) # In[ ]: -#print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]") - +print("[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]") \ No newline at end of file diff --git a/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedKMeans/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedKMeans/sample.json index e6e52de105..07e667ff25 100755 --- a/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedKMeans/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedKMeans/sample.json @@ -14,15 +14,8 @@ "linux": [{ "env": [ "source /intel/oneapi/intelpython/bin/activate", - "source activate base", - "pip install -r requirements.txt", - "wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "rm GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "echo \"deb https://apt.repos.intel.com/oneapi all main\" | tee /etc/apt/sources.list.d/oneAPI.list", - "apt update", - "apt -y install intel-hpckit", - "source /opt/intel/oneapi/setvars.sh" + "conda activate base", + "pip install -r requirements.txt" ], "id": "idp_d4p_KM_Dist", "steps": [ diff --git a/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedLinearRegression/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedLinearRegression/sample.json index bb800f355f..62b036815b 100755 --- a/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedLinearRegression/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelPython_daal4py_DistributedLinearRegression/sample.json @@ -14,14 +14,7 @@ "linux": [{ "env": [ "source /intel/oneapi/intelpython/bin/activate", - "pip install -r requirements.txt", - "wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "rm GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB", - "echo \"deb https://apt.repos.intel.com/oneapi all main\" | tee /etc/apt/sources.list.d/oneAPI.list", - "apt update", - "apt -y install intel-hpckit", - "source /opt/intel/oneapi/setvars.sh" + "pip install -r requirements.txt" ], "id": "idp_d4p_Linear_Regression_Dist", "steps": [ diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/README.md b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/README.md index ebc25d1d7d..9b484f0213 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/README.md +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/README.md @@ -43,7 +43,7 @@ You will need to download and install the following toolkits, tools, and compone - **Other dependencies** - Install using PIP and the `requirements.txt` file supplied with the sample: `$pip install -r requirements.txt`.
The `requirements.txt` file contains the necessary dependencies to run the Notebook. + Install using PIP and the `requirements.txt` file supplied with the sample: `$pip install -r requirements.txt --no-deps`.
The `requirements.txt` file contains the necessary dependencies to run the Notebook. ### For Intel® DevCloud @@ -117,4 +117,4 @@ For performance analysis, you will see histograms showing different Tensorflow* Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. -Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). \ No newline at end of file +Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/enabling_automixed_precision_for_transfer_learning_with_tensorflow.ipynb b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/enabling_automixed_precision_for_transfer_learning_with_tensorflow.ipynb index 28634988a2..d923a2db0d 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/enabling_automixed_precision_for_transfer_learning_with_tensorflow.ipynb +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/enabling_automixed_precision_for_transfer_learning_with_tensorflow.ipynb @@ -240,7 +240,7 @@ "\n", "if arch == 'SPR':\n", " # Create a deep copy of the model to train the bf16 model separately to compare accuracy\n", - " bf16_model = deepcopy(fp32_model)\n", + " bf16_model = tf.keras.models.clone_model(fp32_model)\n", "\n", "fp32_model.summary()" ] diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/requirements.txt b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/requirements.txt index 071d9c8801..d1622109b9 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/requirements.txt +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/requirements.txt @@ -2,4 +2,3 @@ notebook Pillow tensorflow_hub requests - diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/sample.json index feade98eec..c1bfce2009 100755 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Enabling_Auto_Mixed_Precision_for_TransferLearning/sample.json @@ -12,9 +12,13 @@ "ciTests": { "linux": [{ "env": [ + "echo \"deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal\" | tee /etc/apt/sources.list.d/tensorflow-serving.list", + "curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -", + "apt-get update && apt-get install tensorflow-model-server", "source /intel/oneapi/intelpython/bin/activate", "conda activate tensorflow", - "pip install -r requirements.txt", + "pip install -r requirements.txt --no-deps", + "pip install tensorflow==2.15.0.post1", "pip install jupyter ipykernel", "python -m ipykernel install --user --name=tensorflow" ], diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Horovod_Distributed_Deep_Learning/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Horovod_Distributed_Deep_Learning/sample.json index 7c9a7b3450..33ce8b7638 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Horovod_Distributed_Deep_Learning/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Horovod_Distributed_Deep_Learning/sample.json @@ -14,8 +14,7 @@ "env": [ "source /intel/oneapi/intelpython/bin/activate", "conda activate tensorflow-gpu", - "pip install intel-optimization-for-horovod", - "pip install jupyter ipykernel", + "pip install jupyter", "python -m ipykernel install --user --name=tensorflow-gpu" ], "id": "distributed_learning_tensorflow_horovod_py", diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/sample.json index 6b51259a79..5e43dba12e 100755 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/sample.json @@ -15,6 +15,7 @@ "steps": [ "source /intel/oneapi/intelpython/bin/activate", "conda activate tensorflow", + "pip install tensorflow==2.15.0.post1", "pip install ipykernel jupyter", "python -m ipykernel install --user --name=tensorflow", "jupyter nbconvert --to notebook --execute tutorial_optimize_TensorFlow_pretrained_model.ipynb" diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/profile_utils.py b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/profile_utils.py index 5347302738..7cb52c86b7 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/profile_utils.py +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/profile_utils.py @@ -6,7 +6,6 @@ import os, fnmatch import psutil import ast -import tensorflow.estimator from tensorflow.python.training import training_util try: from git import Repo @@ -64,49 +63,6 @@ def save_timeline(self, fname): self.many_runs_timeline.save(fname) -class tfProfileHook(tf.estimator.ProfilerHook): - def __init__(self, save_steps=None, save_secs=None, output_dir="", json_fname="", timeline_count=10): - self._output_tag = "blah-{}" - self._output_dir = output_dir - self._timer = tf.estimator.SecondOrStepTimer(every_secs=save_secs, - every_steps=save_steps) - self._atomic_counter = 0 - self.many_runs_timeline = TimeLiner() - self.timeline_count = timeline_count - import os - ProfileUtilsRoot = os.environ['ProfileUtilsRoot'] - self.json_fname = ProfileUtilsRoot + "/../" + json_fname - if output_dir == "": - output_dir = ProfileUtilsRoot + "/../" - - def begin(self): - self._next_step = None - self._global_step_tensor = training_util.get_global_step() - - if self._global_step_tensor is None: - raise RuntimeError("Global step should be created to use ProfilerHook.") - - def before_run(self, run_context): - self._request_summary = (self._next_step is None or self._timer.should_trigger_for_step(self._next_step)) - requests = {} - opts = tf.compat.v1.RunOptions(trace_level=tf.compat.v1.RunOptions.FULL_TRACE) - return tf.estimator.SessionRunArgs(requests, options=opts) - - def after_run(self, run_context, run_values): - - global_step = self._atomic_counter + 1 - self._atomic_counter = self._atomic_counter + 1 - self._next_step = global_step + 1 - - self.many_runs_timeline.update_timeline_from_runmeta(run_values.run_metadata) - if self._atomic_counter == self.timeline_count: - self.many_runs_timeline.save(self.json_fname) - - def end(self, session): - if self._atomic_counter < self.timeline_count: - self.many_runs_timeline.save(self.json_fname) - - class TensorflowUtils: def is_mkl_enabled(self): diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/sample.json index 245e528fb8..bc6b668f72 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/sample.json +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/sample.json @@ -18,10 +18,11 @@ "conda create --name user_tensorflow-gpu --clone tensorflow-gpu", "conda activate user_tensorflow-gpu", "pip install -r requirements.txt", - "~/.conda/envs/user_tensorflow-gpu/bin/python -m ipykernel install --user --name=user_tensorflow-gpu" + "python -m ipykernel install --user --name=user_tensorflow-gpu" ], "id": "inc_text_generation_lstm_py", "steps": [ + "export ITEX_ENABLE_NEXTPLUGGABLE_DEVICE=0", "python TextGenerationModelTraining.py" ] } diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_for_LLMs/requirements.txt b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_for_LLMs/requirements.txt index a3b45cd16b..a912fb36c8 100644 --- a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_for_LLMs/requirements.txt +++ b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_for_LLMs/requirements.txt @@ -13,6 +13,6 @@ tensorboard-data-server tensorboard-plugin-wit tf-estimator-nightly tokenizers -transformers==4.30.0 +transformers evaluate requests diff --git a/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch/README.MD b/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch/README.MD index 5056b72dba..7920902ad1 100644 --- a/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch/README.MD +++ b/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch/README.MD @@ -25,10 +25,9 @@ The sample starts by loading a BERT model from Hugging Face. After loading the m > **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). ## Key Implementation Details +The sample contains one Jupyter Notebook and one Python script. It can be run using Jupyter notebooks or the offline installer. -The sample contains one Jupyter Notebook and one Python script. - -### Jupyter Notebook +### Use the following steps to run using Jupyter notebook or Python script |Notebook |Description |:--- |:--- @@ -40,14 +39,14 @@ The sample contains one Jupyter Notebook and one Python script. |:--- |:--- |`dataset.py` | The script provides a PyTorch* Dataset class that tokenizes text data -## Environment Setup +### Setup your environment for the offline installer You will need to download and install the following toolkits, tools, and components to use the sample. **1. Get Intel® AI Tools** Required AI Tools: **Intel® Neural Compressor, Intel® Extension of PyTorch***. -
Select and install needed AI Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. +
If you have not already, select and install these Tools via via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. **2. Install dependencies** ``` @@ -65,7 +64,15 @@ Go to the section which corresponds to the installation method chosen in [AI Too * [Docker](#docker) ### AI Tools Offline Installer (Validated) -1. If you have not already done so, complete [Configure you system](https://www.intel.com/content/www/us/en/docs/oneapi-ai-analytics-toolkit/get-started-guide-linux/2024-0/before-you-begin.html) section from AI Tools Get Started Guide. +1. If you have not already done so, activate the AI Tools bundle base environment. +If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` 2. Activate the Conda environment: ``` conda activate pytorch @@ -73,7 +80,7 @@ conda activate pytorch 3. Clone the GitHub repository: ``` git clone https://github.com/oneapi-src/oneAPI-samples.git -cd oneapi-samples/AI-and-Analytics/Getting-Started-Samples +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples ``` 4. Launch Jupyter Notebook: > **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, @@ -94,7 +101,7 @@ optimize_pytorch_models_with_ipex.ipynb 1. Clone the GitHub repository: ``` git clone https://github.com/oneapi-src/oneAPI-samples.git -cd oneapi-samples/AI-and-Analytics/Getting-Started-Samples +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples ``` 2. Launch Jupyter Notebook: > **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, diff --git a/AI-and-Analytics/Getting-Started-Samples/INC-Sample-for-Tensorflow/README.md b/AI-and-Analytics/Getting-Started-Samples/INC-Sample-for-Tensorflow/README.md index b38890e511..a488c45315 100644 --- a/AI-and-Analytics/Getting-Started-Samples/INC-Sample-for-Tensorflow/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/INC-Sample-for-Tensorflow/README.md @@ -1,20 +1,21 @@ -# `Intel® Neural Compressor (INC) TensorFlow* Getting Started*` Sample +# `Intel® Neural Compressor TensorFlow* Getting Started*` Sample -The `Intel® Neural Compressor (INC) TensorFlow* Getting Started*` Sample demonstrates using the Intel® Neural Compressor (INC), which is part of the Intel® AI Tools with the with Intel® Optimizations for TensorFlow* to speed up inference by simplifying the process of converting the FP32 model to INT8/BF16. +This sample demonstrates using the Intel® Neural Compressor, which is part of the Intel® AI Tools with the with Intel® Optimizations for TensorFlow* to speed up inference by simplifying the process of converting the FP32 model to INT8/BF16. -| Area | Description +| Property | Description |:--- |:--- -| What you will learn | How to use Intel® Neural Compressor (INC) tool to quantize the AI model based on TensorFlow* and speed up the inference on Intel® Xeon® CPUs -| Time to complete | 10 minutes | Category | Getting Started +| What you will learn | How to use Intel® Neural Compressor tool to quantize the AI model based on TensorFlow* and speed up the inference on Intel® Xeon® CPUs +| Time to complete | 10 minutes + ## Purpose -This sample shows the process of building a convolutional neural network (CNN) model to recognize handwritten numbers and demonstrates how to increase the inference performance by using Intel® Neural Compressor (INC). Low-precision optimizations can speed up inference. Intel® Neural Compressor (INC) simplifies the process of converting the FP32 model to INT8/BF16. At the same time, Intel® Neural Compressor (INC) tunes the quantization method to reduce the accuracy loss, which is a big blocker for low-precision inference. +This sample shows the process of building a convolutional neural network (CNN) model to recognize handwritten numbers and demonstrates how to increase the inference performance by using Intel® Neural Compressor. Low-precision optimizations can speed up inference. Intel® Neural Compressor simplifies the process of converting the FP32 model to INT8/BF16. At the same time, Intel® Neural Compressor tunes the quantization method to reduce the accuracy loss, which is a big blocker for low-precision inference. You can achieve higher inference performance by converting the FP32 model to INT8 or BF16 model. Additionally, Intel® Deep Learning Boost (Intel® DL Boost) in Intel® Xeon® Scalable processors and Xeon® processors provides hardware acceleration for INT8 and BF16 models. -You will learn how to train a CNN model with Keras and TensorFlow*, use Intel® Neural Compressor (INC) to quantize the model, and compare the performance to see the benefit of Intel® Neural Compressor (INC). +You will learn how to train a CNN model with Keras and TensorFlow*, use Intel® Neural Compressor to quantize the model, and compare the performance to see the benefit of Intel® Neural Compressor. ## Prerequisites @@ -22,13 +23,13 @@ You will learn how to train a CNN model with Keras and TensorFlow*, use Intel® |:--- |:--- | OS | Ubuntu* 20.04 (or newer)
Windows 11, 10* | Hardware | Intel® Core™ Gen10 Processor
Intel® Xeon® Scalable Performance processors -| Software | Intel® Neural Compressor (INC), Intel Optimization for TensorFlow +| Software | Intel® Neural Compressor, Intel Optimization for TensorFlow -### Intel® Neural Compressor (INC) and Sample Code Versions +### Intel® Neural Compressor and Sample Code Versions ->**Note**: See the [Intel® Neural Compressor (INC)](https://github.com/intel/neural-compressor) GitHub repository for more information and recent changes. +>**Note**: See the [Intel® Neural Compressor](https://github.com/intel/neural-compressor) GitHub repository for more information and recent changes. -This sample is updated regularly to match the Intel® Neural Compressor (INC) version in the latest Intel® AI Tools release. If you want to get the sample code for an earlier toolkit release, check out the corresponding git tag. +This sample is updated regularly to match the Intel® Neural Compressor version in the latest Intel® AI Tools release. If you want to get the sample code for an earlier toolkit release, check out the corresponding git tag. 1. List the available git tags. ``` @@ -63,25 +64,25 @@ You will need to download and install the following toolkits, tools, and compone The sample demonstrates how to: - Use Keras from TensorFlow* to build and train a CNN model. -- Define a function and class for Intel® Neural Compressor (INC) to +- Define a function and class for Intel® Neural Compressor to quantize the CNN model. - - The Intel® Neural Compressor (INC) can run on any Intel® CPU to quantize the AI model. + - The Intel® Neural Compressor can run on any Intel® CPU to quantize the AI model. - The quantized AI model has better inference performance than the FP32 model on Intel CPUs. - Specifically, the latest Intel® Xeon® Scalable processors and Xeon® processors provide hardware acceleration for such tasks. - Test the performance of the FP32 model and INT8 (quantization) model. -## Prepare the Environment +## Environment Setup If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Notebook. -### On Linux* (Only applicable to AI Tools Offline Installer) +### On Linux* -#### Set Environment Variables +#### Setup Conda Environment -When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the `setvars` script every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development. +You can list the available conda environments using a command similar to the following. -#### Activate Conda +##### Option 1: Clone Conda Environment from AI Toolkit Conda Environment -You can list the available conda environments using a command similar to the following +Please confirm to install Intel AI Toolkit! ``` conda info -e @@ -115,25 +116,14 @@ tensorflow-2.3.0 /opt/intel/oneapi/intelpython/latest/envs/tensorflow-2. ``` source activate usr_tensorflow ``` -2. Install Intel® Neural Compressor (INC) from the local channel. +2. Install Intel® Neural Compressor from the local channel. ``` conda install -c ${ONEAPI_ROOT}/conda_channel neural-compressor -y --offline ``` -#### Configure Jupyter Notebook +##### Option 2: Create Conda Environment -1. Create a new kernel for the Jupyter notebook based on your activated conda environment. - ``` - conda install ipykernel - python -m ipykernel install --user --name usr_tensorflow - ``` - This step is optional if you plan to open the notebook on your local server. - -### On Windows* - -#### Configure Conda - -1. Configure Conda for **user_tensorflow** by entering commands similar to the following: +Configure Conda for **user_tensorflow** by entering commands similar to the following: ``` conda deactivate conda env remove -n user_tensorflow @@ -142,25 +132,42 @@ tensorflow-2.3.0 /opt/intel/oneapi/intelpython/latest/envs/tensorflow-2. conda install -n user_tensorflow pycocotools -c esri -y conda install -n user_tensorflow neural-compressor tensorflow -c conda-forge -c intel -y conda install -n user_tensorflow jupyter runipy notebook -y + conda install -c anaconda ipykernel + python -m ipykernel install --user --nam=user_tensorflow + ``` + + +#### Configure Jupyter Notebook + +Create a new kernel for the Jupyter notebook based on your activated conda environment. ``` + conda install ipykernel + python -m ipykernel install --user --name usr_tensorflow + ``` + This step is optional if you plan to open the notebook on your local server. -## Run the `Intel® Neural Compressor (INC) TensorFlow* Getting Started*` Sample +## Run the `Intel® Neural Compressor TensorFlow* Getting Started*` Sample -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. +> **Note**: Before running the sample, make sure [Environment Setup](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/INC-Sample-for-TensorFlow#environment-setup) is completed. > > Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> Windows*: -> - `C:\Program Files (x86)\Intel\oneAPI\setvars.bat` -> - Windows PowerShell*, use the following command: `cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)* or *[Use the setvars Script with Windows*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-windows.html)*. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. +If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` -### Steps for Intel AI Tools Offline Installer +### Active Conda Environment 1. Ensure you activate the conda environment. ``` @@ -212,11 +219,34 @@ tensorflow-2.3.0 /opt/intel/oneapi/intelpython/latest/envs/tensorflow-2. 4. Change the kernel to **user_tensorflow**. 5. Run every cell in the Notebook in sequence. +## Example Output + +You should see log print and images showing the performance comparison with absolute and relative data and analysis between FP32 and INT8. + +Following is an example. Your data should be different with them. + +``` +#absolute data +throughputs_times [1, 2.51508607887295] +latencys_times [1, 0.38379207710795576] +accuracys_times [0, -0.009999999999990905] + +#relative data +throughputs_times [1, 2.51508607887295] +latencys_times [1, 0.38379207710795576] +accuracys_times [0, -0.009999999999990905] +``` + +![Absolute Performance](img/inc_ab_perf_data.png) +![Relative Performance](img/inc_re_perf_data.png) #### Troubleshooting If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the [Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html) for more information on using the utility. +## Related Samples + +[Pytorch `Getting Started with Intel® Neural Compressor for Quantization` Sample](../INC-Quantization-Sample-for-PyTorch) ## License diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/README.md deleted file mode 100755 index f51b7d7d95..0000000000 --- a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/README.md +++ /dev/null @@ -1,166 +0,0 @@ -# `Intel® AI Tools Container Getting Started` Sample ->**Note**: This sample is relevant only for AI Tools installed via Docker. If you have installed AI tools using PIP or Conda, this sample may not be relevant for you. - -The `Intel® AI Tools Container Getting Started` sample demonstrates how to use AI Tools containers. - -| Area | Description -|:--- |:--- -| What you will learn | How to start using the Intel® AI Tools container -| Time to complete | 10 minutes -| Category | Tutorial - -For more information on the AI Tools container, see [Intel Deep Learning](https://hub.docker.com/r/intel/deep-learning), [Intel Machine Learning](https://hub.docker.com/r/intel/classical-ml), [Intel Data Analytics](https://hub.docker.com/r/intel/data-analytics), and [Intel Inference Optimization](https://hub.docker.com/r/intel/inference-optimization) Docker Hub pages. - -## Purpose - -This sample provides a Bash script to help you configure an AI Tools container environment. You can build and train deep learning models using this Docker* environment. - -Containers allow you to set up and configure environments for building, running, and profiling AI applications and distribute them using images. You can also use Kubernetes* to automate the deployment and management of containers in the cloud. - -Read the [Get Started with the Intel® AI Tools for Linux*](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) to find out how you can achieve performance gains for popular deep-learning and machine-learning frameworks through Intel optimizations. - -This sample shows an easy way to start using any of the [Intel® AI Tools](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) components without the hassle of installing the toolkit, configuring networking and file sharing. - -## Prerequisites - -| Optimized for | Description -|:--- |:--- -| OS | Ubuntu* 20.04 (or newer) -| Hardware | Intel® Xeon® Scalable processor family -| Software | Intel® AI Tools Container - -## Key Implementation Details - -The Bash script provided in this sample performs the following -configuration steps: - -- Mounts the `/home` folder from host machine into the Docker container. You can share files between the host machine and the Docker container through the `/home` folder. - -- Applies proxy settings from the host machine into the Docker container. - -- Uses the same IP addresses between the host machine and the Docker container. - -- Forwards ports 8888, 6006, 6543, and 12345 from the host machine to the Docker container for some popular network services, such as Jupyter* Notebook and TensorFlow* TensorBoard. - -- Enable VTune Profiling - -## Run the `Intel® AI Tools Deep Learning Container Getting Started` Sample - -This sample uses a configuration script to automatically configure the environment. This provides fast and less error prone setup. For complete instructions for using the AI Tools container, see the [Getting Started Guide](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top/using-containers.html). - -### On Linux* - -You must have [Docker](https://docs.docker.com/engine/install/) -installed. - -1. Open a terminal. -2. Change to the sample folder, and pull the AI Tools Deep Learning Docker image by following [AI Tools Selector page](www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). - ex: - ``` - docker pull intel/deep-learning:2024.0-py3.10 - ``` - >**Note**: If a permission denied error occurs, run the following command. - >``` - >sudo usermod -aG docker $USER - >``` - -4. Run the Docker images using the `run_oneapi_docker.sh` Bash script. - ``` - ./run_oneapi_docker.sh intel/deep-learning:2024.0-py3.10 - ``` - The script opens a Bash shell inside the Docker container and name the docker instance as "aitools_container" by default. - > **Note**: Install additional packages by adding them into requirements.txt file in the sample. Copy the modified requirements.txt into /tmp folder, so the bash script will install those packages for you. - - To create a Bash session in the running container from outside the Docker container, enter a command similar to the following. - ``` - docker exec -it aitools_container /bin/bash - ``` -5. In the Bash shell inside the Docker container, activate the specialized environment. - ``` - source activate tensorflow - ``` - or - ``` - source activate torch - ``` -You can start using Intel® Optimization for TensorFlow* or Intel® Optimization for PyTorch* inside the Docker container. - ->**Note**: You can verify the activated environment. Change to the directory with the IntelAIKitContainer sample and run the `version_check.py` script. ->``` ->python version_check.py ->``` - -### Manage Docker* Images - -You can install additional packages, upload the workloads via the `/tmp` folder, and then commit your changes into a new Docker image, for example, `intel/deep-learning-v1`. -``` -docker commit -a "intel" -m "test" DOCKER_ID intel/deep-learning-v1 -``` ->**Note**: Replace `DOCKER_ID` with the ID of your container. Use `docker ps` to get the DOCKER_ID of your Docker container. - -You can use the new image name to start Docker. -``` -./run_oneapi_docker.sh intel/deep-learning-v1 -``` - -You can save the Docker image as a tar file. -``` -docker save -o oneapi-aikit-v1.tar intel/deep-learning-v1 -``` - -You can load the tar file on other machines. -``` -docker load -i deep-learning-v1.tar -``` - -### Docker Proxy - -For Docker proxy related problem, you could follow below instructions to configure proxy settings for your Docker client. - -1. Create a directory for the Docker service configurations. - ``` - sudo mkdir -p /etc/systemd/system/docker.service.d - ``` -2. Create a file called `proxy.conf` in our configuration directory. - ``` - sudo vi /etc/systemd/system/docker.service.d/proxy.conf - ``` -3. Add the contents similar to the following to the `.conf` file. Change the values to match your environment. - ``` - [Service] - Environment="HTTP_PROXY=http://proxy-hostname:911/" - Environment="HTTPS_PROXY="http://proxy-hostname:911/ - Environment="NO_PROXY="10.0.0.0/8,192.168.0.0/16,localhost,127.0.0.0/8,134.134.0.0/16" - ``` -4. Save your changes and exit the text editor. -5. Reload the daemon configuration. - ``` - sudo systemctl daemon-reload - ``` -6. Restart Docker to apply our changes. - ``` - sudo systemctl restart docker.service - ``` - -## Example Output - -### Output from TensorFlow* Environment - -``` -TensorFlow version: 2.6.0 -MKL enabled : True -``` - -### Output from PyTorch* Environment - -``` -PyTorch Version: 1.8.0a0+37c1f4a -mkldnn : True, mkl : True, openmp : True -``` - -## License - -Code samples are licensed under the MIT license. See -[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. - -Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/requirements.txt b/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/requirements.txt deleted file mode 100644 index 6ccafc3f90..0000000000 --- a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/requirements.txt +++ /dev/null @@ -1 +0,0 @@ -matplotlib diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/run_oneapi_docker.sh b/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/run_oneapi_docker.sh deleted file mode 100755 index a0ae45bd77..0000000000 --- a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/run_oneapi_docker.sh +++ /dev/null @@ -1,36 +0,0 @@ -if [ -z "$1" ] ; then - echo "Usage: $0 [optional command]" - echo "Missing Docker image id. exiting" - exit -1 -fi - -image_id="$1" -name="aitools_container" -gpu_arg="" -GPU_DEV=/dev/dri -if [ -d "$GPU_DEV" ]; then - echo "$GPU_DEV exists." - gpu_arg=" --device=/dev/dri --ipc=host " -fi - -## remove any previously running containers -docker rm -f "$name" - -# mount the current directory at /work -this="${BASH_SOURCE-$0}" -mydir=$(cd -P -- "$(dirname -- "$this")" && pwd -P) - -export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} -e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} -e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} -e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} -e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} -e SOCKS_PROXY=${SOCKS_PROXY}" - -docker run --privileged $DOCKER_RUN_ENVS --rm --pid=host --cap-add=SYS_ADMIN --cap-add=SYS_PTRACE -dit --name "$name" $gpu_arg \ - -p 8888:8888 \ - -p 6006:6006 \ - -v"${PWD}:/home/dev/jupyter" \ - -v"/tmp:/tmp" \ - --net host \ - -p 6543:6543 \ - -p 12345:12345 \ - "$image_id" -docker exec -it "$name" /bin/bash -c "pip install -r /tmp/requirements.txt" -docker exec -it "$name" /bin/bash -c "apt-get update -yq;apt-get install -yq vim numactl" -docker exec -it "$name" /bin/bash diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/sample.json b/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/sample.json deleted file mode 100755 index baac1c2814..0000000000 --- a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/sample.json +++ /dev/null @@ -1,23 +0,0 @@ -{ - "guid": "0F95DA9E-0A5D-4CF2-B791-885B09675004", - "name": "Intel(R) AI Analytics Toolkit (AI Kit) Container Getting Started", - "categories": ["Toolkit/oneAPI AI And Analytics/Getting Started"], - "description": "This sample illustrates how to utilize the oneAPI AI Kit container.", - "builder": ["cli"], - "languages": [{"python":{}}], - "os":["linux"], - "targetDevice": ["CPU"], - "ciTests": { - "linux": [ - { - "id": "verion check", - "steps": [ - "source /intel/oneapi/intelpython/bin/activate", - "source activate tensorflow", - "python version_check.py" - ] - } - ] - }, - "expertise": "Tutorial" -} diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/version_check.py b/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/version_check.py deleted file mode 100644 index 2086074f4c..0000000000 --- a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/version_check.py +++ /dev/null @@ -1,82 +0,0 @@ -#import importlib -from importlib import util -tensorflow_found = util.find_spec("tensorflow") is not None -pytorch_found = util.find_spec("torch") is not None -pytorch_ext_found = util.find_spec("intel_pytorch_extension") is not None -tensorflow_ext_found = util.find_spec("intel_extension_for_tensorflow") is not None -xgboost_found = util.find_spec("xgboost") is not None -sklearn_found = util.find_spec("sklearn") is not None -sklearnex_found = util.find_spec("sklearnex") is not None -inc_found = util.find_spec("neural_compressor") is not None -modin_found = util.find_spec("modin") is not None - -if tensorflow_found == True: - - import tensorflow as tf - - import os - - def get_mkl_enabled_flag(): - - mkl_enabled = False - major_version = int(tf.__version__.split(".")[0]) - minor_version = int(tf.__version__.split(".")[1]) - if major_version >= 2: - onednn_enabled = 0 - if minor_version < 5: - from tensorflow.python import _pywrap_util_port - else: - from tensorflow.python.util import _pywrap_util_port - onednn_enabled = int(os.environ.get('TF_ENABLE_ONEDNN_OPTS', '0')) - mkl_enabled = _pywrap_util_port.IsMklEnabled() or (onednn_enabled == 1) - else: - mkl_enabled = tf.pywrap_tensorflow.IsMklEnabled() - return mkl_enabled - - print ("TensorTlow version: ", tf.__version__) - print("MKL enabled :", get_mkl_enabled_flag()) - if tensorflow_ext_found == True: - import intel_extension_for_tensorflow as itex - print("itex_version : ", itex.__version__) - -if pytorch_found == True: - import torch - print("PyTorch Version: ", torch.__version__) - mkldnn_enabled = torch.backends.mkldnn.is_available() - mkl_enabled = torch.backends.mkl.is_available() - openmp_enabled = torch.backends.openmp.is_available() - print('mkldnn : {0}, mkl : {1}, openmp : {2}'.format(mkldnn_enabled, mkl_enabled, openmp_enabled)) - print(torch.__config__.show()) - - if pytorch_ext_found == True: - import intel_pytorch_extension as ipex - print("ipex_verion : ",ipex.__version__) - -if xgboost_found == True: - import xgboost as xgb - print("XGBoost Version: ", xgb.__version__) - -if modin_found == True: - import modin - import modin.config as cfg - major_version = int(modin.__version__.split(".")[0]) - minor_version = int(modin.__version__.split(".")[1]) - print("Modin Version: ", modin.__version__) - cfg_engine = '' - if minor_version > 12 and major_version == 0: - cfg_engine = cfg.StorageFormat.get() - - else: - cfg_engine = cfg.Engine.get() - print("Modin Engine: ", cfg_engine) - -if sklearn_found == True: - import sklearn - print("scikit learn Version: ", sklearn.__version__) - if sklearnex_found == True: - import sklearnex - print("have scikit learn ext 2021.4 : ", sklearnex._utils.get_sklearnex_version((2021, 'P', 400))) - -if inc_found == True: - import neural_compressor as inc - print("neural_compressor version {}".format(inc.__version__)) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted/README.md index c8401d4492..cc86dd9c84 100755 --- a/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted/README.md @@ -1,12 +1,12 @@ -# `Intel® Python XGBoost* Getting Started` Sample +# Intel® Python XGBoost* Getting Started Sample The `Intel® Python XGBoost* Getting Started` sample demonstrates how to set up and train an XGBoost model on datasets for prediction. | Area | Description | :--- | :--- +| Category | Getting Started | What you will learn | The basics of XGBoost programming model for Intel CPUs | Time to complete | 5 minutes -| Category | Getting Started ## Purpose @@ -24,75 +24,93 @@ In this code sample, you will learn how to use Intel optimizations for XGBoost p ## Key Implementation Details -This Getting Started sample code is implemented for CPU using the Python language. The example assumes you have XGboost installed inside a conda environment, similar to what is delivered with the installation of the Intel® Distribution for Python* as part of the [Intel® AI Tools](https://software.intel.com/en-us/oneapi/ai-kit). +- This Getting Started sample code is implemented for CPU using the Python language. The example assumes you have XGboost installed inside a conda environment, similar to what is delivered with the installation of the Intel® Distribution for Python* as part of the [Intel® AI Tools](https://software.intel.com/en-us/oneapi/ai-kit). -XGBoost* is ready for use once you finish the Intel® AI Tools installation and have run the post installation script. +- XGBoost* is ready for use once you finish the Intel® AI Tools installation and have run the post installation script. -## Configure Environment (Only applicable to Intel AI Tools Offline Installer) -If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Notebook. -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)*. +## Environment Setup +You will need to download and install the following toolkits, tools, and components to use the sample. -### Activate Conda with Root Access +**1. Get Intel® AI Tools** -If you activated another environment, you can return with the following command: -``` -source activate base -``` -### Activate Conda without Root Access (Optional) +Required AI Tools: Intel® Optimization for XGBoost* +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. -By default, the Intel® AI Tools are installed in the inteloneapi folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone and active your desired conda environment using the following commands: +**2. Install dependencies** ``` -conda create --name user_base --clone base -source activate user_base +pip install -r requirements.txt ``` +**Install Jupyter Notebook** by running `pip install notebook`. Alternatively, see [Installing Jupyter](https://jupyter.org/install) for detailed installation instructions. -## Run the `Intel® Python XGBoost* Getting Started` Sample - -### Install Jupyter Notebook - -1. Change to the sample directory. -2. Install Jupyter Notebook with an appropriate kernel. - ``` - conda install jupyter nb_conda_kernels - ``` -### Run Jupyter Notebook +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted#environment-setup) is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) ->**Note**: You cannot execute the sample in Jupyter Notebook, but you can still view inside the notebook to follow the included write-up and description. +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Activate the Conda environment: +``` +conda activate xgboost +``` +3. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted +``` -1. Change to the sample directory. -2. Launch Jupyter Notebook. - ``` - jupyter notebook - ``` -3. Locate and select the Notebook. - ``` - IntelPython_XGBoost_GettingStarted.ipynb - ``` -4. Click the **Run** button to move through the cells in sequence. +4. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` + +5. Follow the instructions to open the URL with the token in your browser. +6. Select the Notebook: +``` +IntelPython_XGBoost_GettingStarted.ipynb +``` -### Run the Python Script +7. Change the kernel to xgboost + +8. Run every cell in the Notebook in sequence. -1. Still in Jupyter Notebook. +### Conda/PIP +> **Note**: Make sure your Conda/Python environment with AI Tools installed is activated +1. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted +``` +2. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` + +4. Follow the instructions to open the URL with the token in your browser. +5. Select the Notebook: +``` +IntelPython_XGBoost_GettingStarted.ipynb +``` +6. Run every cell in the Notebook in sequence. -2. Select **File** > **Download as** > **Python (py)**. -3. Run the script. - ``` - python IntelPython_XGBoost_GettingStarted.py - ``` - The output files of the script will be saved in **models** and **result** directories. +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. -#### Troubleshooting -If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the [Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html) for more information on using the utility. ## Example Output @@ -102,6 +120,11 @@ If you receive an error message, troubleshoot the problem using the **Diagnostic RMSE: 11.113036205909719 [CODE_SAMPLE_COMPLETED_SUCCESFULLY] ``` +## Related Samples + +* [Intel® Python XGBoost Daal4py Prediction](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Features-and-Functionality/IntelPython_XGBoost_daal4pyPrediction) +* [Intel® Python Scikit-learn Extension Getting Started](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted) + ## License @@ -109,3 +132,5 @@ Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted/README.md index 6b1c1ec08b..2789265ec2 100755 --- a/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted/README.md @@ -1,14 +1,13 @@ -# `Intel® Python Daal4py Getting Started` Sample +# Intel® Python Daal4py Getting Started Sample -The `Intel® Python Daal4py Getting Started` sample code shows how to do batch linear regression using the Python API package daal4py powered by the Intel® oneAPI Data Analytics Library (oneDAL). +The `Intel® Python Daal4py Getting Started` sample code shows how to do batch linear regression using the Python API package daal4py powered by the [Intel® oneAPI Data Analytics Library (oneDAL)](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onedal.html). | Area | Description | :--- | :--- +| Category | Getting Started | What you will learn | Basic daal4py programming model for Intel CPUs | Time to complete | 5 minutes -| Category | Getting Started -The sample demonstrates how to use software products that are powered by the [Intel® oneAPI Data Analytics Library (oneDAL)](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onedal.html) and some components found in the [Intel® AI Tools](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). ## Purpose @@ -23,6 +22,7 @@ In this sample, you will run a batch Linear Regression model with oneDAL daal4py | OS | Ubuntu* 20.04 (or newer) | Hardware | Intel Atom® processors
Intel® Core™ processor family
Intel® Xeon® processor family
Intel® Xeon® Scalable processor family | Software | Intel® oneAPI Data Analytics Library (oneDAL) +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). ### For Local Development Environments @@ -35,89 +35,95 @@ You will need to download and install the following toolkits, tools, and compone ## Key Implementation Details -This get started sample code is implemented for CPUs using the Python language. The example assumes you have daal4py and scikit-learn installed inside a conda environment, similar to what is delivered with the installation of the Intel® Distribution for Python* as part of the Intel® AI Analytics Toolkit. - -The Intel® oneAPI Data Analytics Library (oneDAL) is ready for use once you finish the Intel® AI Analytics Toolkit installation and have run the post installation script. - -## Configure Environment (Only applicable to AI Tools Offline Installer) -If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Notebook. - -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)*. - - +- This get started sample code is implemented for CPUs using the Python language. The example assumes you have daal4py and scikit-learn installed inside a conda environment, similar to what is delivered with the installation of the Intel® Distribution for Python* as part of the Intel® AI Analytics Toolkit. -### Steps for Intel AI Tools Offline Installer +- The Intel® oneAPI Data Analytics Library (oneDAL) is ready for use once you finish the Intel® AI Analytics Toolkit installation and have run the post installation script. -1. Activate the conda environment. +## Environment Setup - 1. If you have the root access to your oneAPI installation path, choose this option. - - Intel Python environment will be active by default. However, if you activated another environment, you can return with the following command. - ``` - source activate base - ``` - - 2. If you do not have the root access to your oneAPI installation path, choose this option. - - By default, the Intel® AI Tools are installed in the ``/opt/intel/oneapi`` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment and activate it using the following commands. +You will need to download and install the following toolkits, tools, and components to use the sample. - ``` - conda create --name usr_intelpython --clone base - source activate usr_intelpython - ``` +**1. Get Intel® AI Tools** -2. Install Jupyter Notebook. (Skip this step for Intel® DevCloud.) - ``` - conda install jupyter nb_conda_kernels - ``` +Required AI Tools: Intel® Optimization for XGBoost* +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. -## Run the `Intel® Python Daal4py Getting Started` Sample +**2. Install dependencies** +``` +pip install -r requirements.txt +``` +**Install Jupyter Notebook** by running `pip install notebook`. Alternatively, see [Installing Jupyter](https://jupyter.org/install) for detailed installation instructions. -You can run the sample code in a Jupyter Notebook or as a Python script locally. +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted#environment-setup) is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) -### Run the Jupyter Notebook +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Activate the Conda environment: + +``` +conda activate xgboost +``` +3. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted +``` +4. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` + +5. Follow the instructions to open the URL with the token in your browser. +6. Select the Notebook: +``` +IntelPython_daal4py_GettingStarted.ipynb +``` -1. Activate the conda environment. - ``` - source activate base - # or - source activate usr_intelpython - ``` +7. Change the kernel to xgboost + +8. Run every cell in the Notebook in sequence. -2. Start the Jupyter notebook server. - ``` - jupyter notebook - ``` +### Conda/PIP +> **Note**: Make sure your Conda/Python environment with AI Tools installed is activated +1. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/IntelPython_daal4py_GettingStarted +``` -3. Locate and select the Notebook. - ``` - IntelPython_daal4py_GettingStarted.ipynb - ``` -4. Click the **Run** button to execute all cells in the Notebook in sequence. +2. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` + +4. Follow the instructions to open the URL with the token in your browser. +5. Select the Notebook: +``` +IntelPython_daal4py_GettingStarted.ipynb.ipynb +``` -### Run the Python Script Locally +6. Run every cell in the Notebook in sequence. -1. Activate the conda environment. - ``` - source activate base - # or - source activate usr_intelpython - ``` +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. -2. Run the Python script. - ``` - python IntelPython_daal4py_GettingStarted.py - ``` - The script saves the output files in the included ``models`` and ``results`` directories. ## Example Output @@ -147,6 +153,10 @@ Here is one of our loaded model's features: 1.58423529e-02 -4.57542900e-01]] [CODE_SAMPLE_COMPLETED_SUCCESFULLY] ``` +## Related Samples + +* [Intel® Python XGBoost* Getting Started Sample](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted) +* [Intel® Python Scikit-learn Extension Getting Started Sample](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted#intel-python-scikit-learn-extension-getting-started-sample) ## License @@ -154,3 +164,5 @@ Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. Third-party program licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/README.md index c16e4c0bb9..6f391f640c 100755 --- a/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/README.md @@ -1,12 +1,11 @@ -# ` TensorFlow* Getting Started` Sample +# `TensorFlow* Getting Started` Sample -The `TensorFlow* Getting Started` sample demonstrates how to train a TensorFlow* model and run inference using Intel® oneAPI Deep Neural Networks (Intel® oneDNN). - -| Area | Description -|:--- |:--- -| What you will learn | The basics of using TensorFlow* with oneDNN optimizations -| Time to complete | 10 minutes -| Category | Getting Started +The `TensorFlow* Getting Started` sample demonstrates how to train a TensorFlow* model and run inference on Intel® hardware. +| Property | Description +|:--- |:--- +| Category | Get Start Sample +| What you will learn | How to start using TensorFlow* on Intel® hardware. +| Time to complete | 10 minutes ## Purpose @@ -18,16 +17,15 @@ This sample code shows how to get started with TensorFlow*. It implements an exa | Optimized for | Description |:--- |:--- -| OS | Ubuntu* 18.0.x (and newer)
Windows* 10 +| OS | Ubuntu* 22.0.4 (and newer)
Windows* 10 and newer | Hardware | Intel® Xeon® Scalable processor family | Software | TensorFlow -TensorFlow* is ready for use once you finish the Intel AI Tools installation. You can refer to the oneAPI [product page](https://software.intel.com/en-us/oneapi) for tools installation and the *[Get Started with the Intel® AI Tools for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux-get-started-with-the-intel-ai-analytics-toolkit)* for post-installation steps and scripts. +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). ## Key Implementation Details -You must export the environment variable `ONEDNN_VERBOSE=1` to display the deep learning primitives trace during execution. - +The sample includes one python file: TensorFlow_HelloWorld.py. it implements a simple neural network's training and inference - The training data is generated by `np.random`. - The neural network with one convolution layer and one ReLU layer is created by `tf.nn.conv2d` and `tf.nn.relu`. - The TF session is initialized by `tf.global_variables_initializer`. @@ -39,70 +37,65 @@ You must export the environment variable `ONEDNN_VERBOSE=1` to display the deep y_batch = y_data[step*N:(step+1)*N, :, :, :] s.run(train, feed_dict={x: x_batch, y: y_batch}) ``` - +In order to show the harware information, you must export the environment variable `ONEDNN_VERBOSE=1` to display the deep learning primitives trace during execution. >**Note**: For convenience, code line os.environ["ONEDNN_VERBOSE"] = "1" has been added in the body of the script as an alternative method to setting this variable. Runtime settings for `ONEDNN_VERBOSE`, `KMP_AFFINITY`, and `Inter/Intra-op` Threads are set within the script. You can read more about these settings in this dedicated document: *[Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads](https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference)*. -### Run the Sample on Intel GPUs - -The sample code is CPU based, but you can run it using Intel® Extension for TensorFlow* with Intel® Data Center GPU Flex Series. If you are using the Intel GPU, refer to *[Intel GPU Software Installation Guide](https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/install_for_gpu.html)*. The sample should be able to run on GPU without any code changes. +### Run the Sample on Intel® GPUs +The sample code is CPU based, but you can run it using Intel® Extension for TensorFlow* with Intel® Data Center GPU Flex Series. If you are using the Intel GPU, refer to *[Intel GPU Software Installation Guide](https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/install_for_gpu.html)*. The sample should be able to run on GPU **without any code changes**. For details, refer to the *[Quick Example on Intel CPU and GPU](https://intel.github.io/intel-extension-for-tensorflow/latest/examples/quick_example.html)* topic of the *Intel® Extension for TensorFlow** documentation. -### Steps for Intel AI Tools Offline Installer - -These instructions demonstrate how to build and run a sample on a machine where you have installed the Intel® AI Tools. If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Script. - -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> Windows*: -> - `C:\Program Files (x86)\Intel\oneAPI\setvars.bat` -> - Windows PowerShell*, use the following command: `cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)* or *[Use the setvars Script with Windows*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-windows.html)*. - -### Activate Conda with Root Access - -By default, the AI Kit is installed in the `intel/oneapi` folder, which requires root privileges to manage it. - -1. Activate Conda. - ``` - conda activate tensorflow - ``` - -### Activate Conda without Root Access (Optional) - -If you would like to bypass using root access to manage your conda environment, then you can clone and active your desired conda environment. - -1. Enter the following commands. - ``` - conda create --name user_tensorflow --clone tensorflow - conda activate user_tensorflow - ``` - +## Environment Setup + +You will need to download and install the following toolkits, tools, and components to use the sample. + +**1. Get Intel® AI Tools** + +Required AI Tools: +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector.
+or simple pip install in your current ready python environment +``` +pip install tensorflow==2.14 +``` +please see the[supported versions](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). + +## Run the Sample + +>**Note**: Before running the sample, make sure Environment Setup is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Activate the Conda environment: +``` +conda activate tensorflow +``` +3. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted +``` ### Run the Script -1. Change to the sample directory. -2. Run the Python script. - ``` - python TensorFlow_HelloWorld.py - ``` - -#### Troubleshooting - -If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility. - - +Run the Python script. +``` +python TensorFlow_HelloWorld.py +``` ## Example Output -1. One the initial run, you should see results similar to the following: +1. With the initial run, you should see results similar to the following: ``` 0 0.4147554 @@ -122,24 +115,36 @@ If you receive an error message, troubleshoot the problem using the **Diagnostic 3. Run the sample again. You should see verbose results similar to the following: ``` - 2022-04-24 16:56:02.497963: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA - To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. - onednn_verbose,info,oneDNN v2.5.0 (commit N/A) - onednn_verbose,info,cpu,runtime:OpenMP - onednn_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost - onednn_verbose,info,gpu,runtime:none - onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time - onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:cdba:f dst_f32:p:blocked:Acdb16a:f,,,10x4x3x3,0.00195312 - onednn_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_training,src_f32::blocked:acdb:f wei_f32:p:blocked:Acdb16a:f bia_f32::blocked:a:f dst_f32::blocked:acdb:f,attr-post-ops:eltwise_relu ,alg:convolution_direct,mb,4.96411 - onednn_verbose,exec,cpu,convolution,jit:avx512_common,backward_weights,src_f32::blocked:acdb:f wei_f32:p:blocked:Acdb16a:f bia_undef::undef::f dst_f32::blocked:acdb:f,,alg:convolution_direct,mb,0.567871 +2024-03-12 16:01:59.784340: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled. +onednn_verbose,info,oneDNN v3.2.0 (commit 8f2a00d86546e44501c61c38817138619febbb10) +onednn_verbose,info,cpu,runtime:OpenMP,nthr:24 +onednn_verbose,info,cpu,isa:Intel AVX2 with Intel DL Boost +onednn_verbose,info,gpu,runtime:none +onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time +onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:cdba::f0 dst_f32:p:blocked:Acdb16a::f0,,,10x4x3x3,0.00195312 +onednn_verbose,exec,cpu,convolution,brgconv:avx2,forward_training,src_f32::blocked:acdb::f0 wei_f32:ap:blocked:Acdb16a::f0 bia_f32::blocked:a::f0 dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-post-ops:eltwise_relu ,alg:convolution_direct,mb4_ic4oc10_ih128oh128kh3sh1dh0ph1_iw128ow128kw3sw1dw0pw1,1.19702 +onednn_verbose,exec,cpu,eltwise,jit:avx2,backward_data,data_f32::blocked:abcd::f0 diff_f32::blocked:abcd::f0,attr-scratchpad:user ,alg:eltwise_relu alpha:0 beta:0,4x128x128x10,0.112061 +onednn_verbose,exec,cpu,convolution,jit:avx2,backward_weights,src_f32::blocked:acdb::f0 wei_f32:ap:blocked:ABcd8b8a::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user ,alg:convolution_direct,mb4_ic4oc10_ih128oh128kh3sh1dh0ph1_iw128ow128kw3sw1dw0pw1,0.358887 ... ``` - >**Note**: See the *[oneAPI Deep Neural Network Library Developer Guide and Reference](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html)* for more details on the verbose log. +4. Troubleshooting + +If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility. +or ask support from https://github.com/intel/intel-extension-for-tensorflow + +## Related Samples + +* [Intel Extension Fot TensorFlow Getting Started Sample](https://github.com/oneapi-src/oneAPI-samples/blob/development/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md) + ## License Code samples are licensed under the MIT license. See -[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. + +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) -Third-party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/README.md index 82ba79b0d7..bf3ce7d2c9 100755 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/README.md @@ -1,33 +1,13 @@ # `Intel® Extension for PyTorch (IPEX) Getting Started` Sample -Intel® Extension for PyTorch (IPEX) extends PyTorch* with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch* releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch* on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). +Intel® Extension for PyTorch (IPEX) extends PyTorch* with optimizations for extra performance boost on Intel hardware. -This sample contains a Jupyter* NoteBook that guides you through the process of running a PyTorch* inference workload on both GPU and CPU by using Intel® AI Tools and also analyze the GPU and CPU usage via Intel® oneAPI Deep Neural Network Library (oneDNN) verbose logs. - -| Area | Description +| Property | Description |:--- |:--- -| What you will learn | How to get started with Intel® Extension for PyTorch (IPEX) +| Category | Get Start Sample +| What you will learn | How to start using Intel® Extension for PyTorch (IPEX) | Time to complete | 15 minutes -## Prerequisites - -| Optimized for | Description -|:--- |:--- -| OS | Ubuntu* 22.04 -| Hardware | Intel® Xeon® scalable processor family
Intel® Data Center GPUs -| Software | Intel® Extension for PyTorch (IPEX) - - -## Hardware requirement - -Verified Hardware Platforms for CPU samples: - - Intel® CPU (Xeon, Core) - -Verified Hardware Platforms for GPU samples: - - [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) - - [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/docs/processors/max-series/overview.html) - - [Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) (experimental) - ## Purpose This sample code demonstrates how to begin using the Intel® Extension for PyTorch (IPEX). @@ -44,6 +24,16 @@ The Jupyter notebook in this sample also guides users how to change PyTorch* cod >Find more examples in the [*Examples*](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/examples.html) topic of the [*Intel® Extension for PyTorch (IPEX) Documentation*](https://intel.github.io/intel-extension-for-pytorch). +## Prerequisites + +| Optimized for | Description +|:--- |:--- +| OS | Ubuntu* 22.04 +| Hardware | Intel® Xeon® scalable processor family
Intel® Data Center GPUs +| Software | Intel® Extension for PyTorch (IPEX) + +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). + ## Key Implementation Details The sample uses pretrained model provided by Intel and published as part of [Intel AI Reference Models](https://github.com/IntelAI/models). The example also illustrates how to utilize TensorFlow* and Intel® Math Kernel Library (Intel® MKL) runtime settings to maximize CPU performance on ResNet50 workload. @@ -55,82 +45,74 @@ The sample uses pretrained model provided by Intel and published as part of [Int > **Note**: The test dataset is inherited from `torch.utils.data.Dataset`, and the model is inherited from `torch.nn.Module`. -## Run the `Intel® Extension for PyTorch (IPEX) Getting Started` Sample +## Environment Setup +You will need to download and install the following toolkits, tools, and components to use the sample. -If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Notebook. +**1. Get Intel® AI Tools** -### Steps for Intel AI Tools Offline Installer +Required AI Tools: +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)*. - -#### Activate Conda - -1. Activate the conda environment: - ``` - conda activate pytorch - ``` - -2. Activate conda environment without Root access (Optional). - - By default, the Intel AI Tools are installed in the `/opt/intel/oneapi` folder and require root privileges to manage them. - - - You can choose to activate Conda environment without root access. To bypass root access to manage your Conda environment, clone and activate your desired Conda environment using the following commands similar to the following. - ``` - conda create --name user_pytorch --clone pytorch - ``` - Then activate your conda environment with the following command: - ``` - conda activate user_pytorch - ``` -#### Run the Script - -1. Navigate to the directory with the sample. - ``` - cd ~/oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted - ``` -2. Run the Python script. - ``` - python Intel_Extension_For_PyTorch_Hello_World.py - ``` - You will see the DNNL verbose trace after exporting the `DNNL_VERBOSE`: - ``` - export DNNL_VERBOSE=1 - ``` - >**Note**: Read more information about the mkldnn log at [https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html). - -### Run the Jupyter Notebook - -1. Change to the sample directory. -2. Launch Jupyter Notebook. - ``` - jupyter notebook --ip=0.0.0.0 --port 8888 --allow-root - ``` -3. Follow the instructions to open the URL with the token in your browser. -4. Locate and select the Notebook. - ``` - ResNet50_Inference.ipynb - ``` -5. Change your Jupyter Notebook kernel to **PyTorch**. -6. Run every cell in the Notebook in sequence. - -### Troubleshooting - -If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility. - - -### Example Output + + +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](#environment-setup) is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/ +``` +3. Run the Python script. +``` +python Intel_Extension_For_PyTorch_Hello_World.py +``` +You will see the DNNL verbose trace after exporting the `DNNL_VERBOSE`: +``` +export DNNL_VERBOSE=1 +``` +>**Note**: Read more information about the mkldnn log at [https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html). + +4. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 --port 8888 --allow-root +``` + +4. Follow the instructions to open the URL with the token in your browser. +5. Locate and select the Notebook: +``` +ResNet50_Inference.ipynb +``` +6. Change your Jupyter Notebook kernel to **PyTorch** or **PyTorch-GPU**. + +7. Run every cell in the Notebook in sequence. + + + +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + + +## Example Output With successful execution, it will print out `[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]` in the terminal. + + ## License Code samples are licensed under the MIT license. See diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted/readme.md b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted/readme.md index 4bda715038..38839d565b 100644 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted/readme.md +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_SKLearn_GettingStarted/readme.md @@ -1,14 +1,18 @@ -# `Intel® Python Scikit-learn Extension Getting Started` Sample +# Intel® Python Scikit-learn Extension Getting Started Sample The `Intel® Python Scikit-learn Extension Getting Started` sample demonstrates how to use a support vector machine classifier from Intel® Extension for Scikit-learn* for digit recognition problem. All other machine learning algorithms available with Scikit-learn can be used in the similar way. Intel® Extension for Scikit-learn* speeds up scikit-learn applications. The acceleration is achieved through the use of the Intel® oneAPI Data Analytics Library (oneDAL) [Intel oneAPI Data Analytics Library](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onedal.html), which comes with [Intel® AI Analytics Toolkit (AI Kit)](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). | Area | Description |:--- | :--- +| Category | Getting Started | What you will learn | How to use a basic Intel® Extension for Scikit-learn* programming model for Intel CPUs | Time to complete | 5 minutes -| Category | Getting Started + +## Purpose + +In this sample, you will run a support vector classifier model from sklearn with oneDAL Daal4py library memory objects. You will also learn how to train a model and save the information to a file. Intel® Extension for Scikit-learn* depends on Intel® Daal4py. Daal4py is a simplified API to oneDAL that allows for fast usage of the framework suited for Data Scientists or Machine Learning users. Built to help provide an abstraction to oneDAL for direct usage or integration into one's own framework. ## Prerequisites | Optimized for | Description @@ -22,29 +26,23 @@ You can refer to the oneAPI [product page](https://software.intel.com/en-us/onea oneDAL is ready for use once you finish the AI Kit installation and have run the post installation script. -## Purpose - -In this sample, you will run a support vector classifier model from sklearn with oneDAL Daal4py library memory objects. You will also learn how to train a model and save the information to a file. Intel® Extension for Scikit-learn* depends on Intel® Daal4py. Daal4py is a simplified API to oneDAL that allows for fast usage of the framework suited for Data Scientists or Machine Learning users. Built to help provide an abstraction to oneDAL for direct usage or integration into one's own framework. ## Key Implementation Details This Getting Started sample code is implemented for CPU using the Python language. The example assumes you have Intel® Extension for Scikit-learn* installed inside a conda environment, similar to what is delivered with the installation of the Intel® Distribution for Python* as part of the [Intel® AI Analytics Toolkit](https://software.intel.com/en-us/oneapi/ai-kit). Intel® Extension for Scikit-learn* is available as a part of Intel® AI Analytics Toolkit (AI kit). -## Configure the Local Environment +## Environment Setup -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)*. - -### On Linux* +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` -#### Activate Conda with Root Access +2. Activate Conda with Root Access Intel Python environment will be active by default. However, if you activated another environment, you can return with the following command. ``` @@ -52,13 +50,18 @@ source activate base pip install -r requirements.txt ``` -#### Activate Conda without Root Access (Optional) +2a. Activate Conda without Root Access (Optional) By default, the Intel® AI Analytics Toolkit is installed in the inteloneapi folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone and activate your desired conda environment using the following commands. ``` conda create --name usr_intelpython --clone base source activate usr_intelpython ``` +3. Clone the GitHub repository +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples +``` ### Install Jupyter Notebook @@ -88,54 +91,6 @@ source activate usr_intelpython If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility. - -### Run the Sample on Intel® DevCloud (Optional) - -1. If you do not already have an account, request an Intel® DevCloud account at [*Create an Intel® DevCloud Account*](https://intelsoftwaresites.secure.force.com/DevCloud/oneapi). -2. On a Linux* system, open a terminal. -3. SSH into Intel® DevCloud. - ``` - ssh DevCloud - ``` - > **Note**: You can find information about configuring your Linux system and connecting to Intel DevCloud at Intel® DevCloud for oneAPI [Get Started](https://devcloud.intel.com/oneapi/get_started). - -#### Run the Notebook - -1. Locate and select the Notebook. - ``` - Intel_Extension_For_SKLearn_GettingStarted.ipynb - ```` -2. Run every cell in the Notebook in sequence. - -#### Run the Python Script - -1. Change to the sample directory. -2. Configure the sample for the appropriate node. -
- You can specify nodes using a single line script. - - ``` - qsub -I -l nodes=1:xeon:ppn=2 -d . - ``` - - - `-I` (upper case I) requests an interactive session. - - `-l nodes=1:xeon:ppn=2` (lower case L) assigns one full GPU node. - - `-d .` makes the current folder as the working directory for the task. - - |Available Nodes |Command Options - |:--- |:--- - |GPU |`qsub -l nodes=1:gpu:ppn=2 -d .` - |CPU |`qsub -l nodes=1:xeon:ppn=2 -d .` - - - >**Note**: For more information on how to specify compute nodes read *[Launch and manage jobs](https://devcloud.intel.com/oneapi/documentation/job-submission/)* in the Intel® DevCloud Documentation. -
- -3. Run the script. - ``` - python Intel_Extension_For_SKLearn_GettingStarted.py - ``` - ## Example Output You should see printed output for cells (with similar numbers) and an accuracy result. @@ -150,9 +105,16 @@ Model accuracy on test data: 0.9833333333333333 [CODE_SAMPLE_COMPLETED_SUCCESFULLY] ``` +## Related Samples + +* [Intel® Python XGBoost* Getting Started](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelPython_XGBoost_GettingStarted) +* [Intel® Python XGBoost Daal4py Prediction](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Features-and-Functionality/IntelPython_XGBoost_daal4pyPrediction) + ## License Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md index 433e4ba797..6e5b98d6ef 100755 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md @@ -1,74 +1,136 @@ -# Intel Extension for TensorFlow Getting Started Sample -This code sample will guide users how to run a tensorflow inference workload on both GPU and CPU by using Intel® AI Tools and also analyze the GPU and CPU usage via oneDNN verbose logs +# `Intel® Extension for TensorFlow* (ITEX) Getting Started` Sample -## Purpose - - Guide users how to use different conda environments in Intel® AI Tools to run TensorFlow workloads on both CPU and GPU - - Guide users how to validate the GPU or CPU usages for TensorFlow workloads on Intel CPU or GPU - - -## Key implementation details -1. leverage the [resnet50 inference sample](https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/infer_resnet50) from intel-extension-for-tensorflow -2. use the resnet50v1.5 pretrained model from TensorFlow Hub -3. infernece with images in intel caffe github -4. guide users how to use different conda environment to run on Intel CPU and GPU -5. analyze oneDNN verbose logs to validate GPU or CPU usage - -## Pre-requirements (Local or Remote Host Installation) +This code sample will guide users how to run a TensorFlow* inference workload on both GPU and CPU by using Intel® AI Tools and also analyze the GPU and CPU usage via oneDNN verbose logs. -TensorFlow* is ready for use once you finish the Intel® AI Tools installation and have run the post installation script. +| Property | Description +|:--- |:--- +| Category | Get Started Sample +| What you will learn | How to start using Intel® Extension for TensorFlow* (ITEX) +| Time to complete | 15 minutes -TensorFlow* is ready for use once you finish the Intel AI Tools installation. You can refer to the oneAPI [product page](https://software.intel.com/en-us/oneapi) for tools installation and the *[Get Started with the Intel® AI Tools for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux-get-started-with-the-intel-ai-analytics-toolkit)* for post-installation steps and scripts. +## Purpose + - Guide users how to use different conda environments in Intel® AI Tools to run TensorFlow* workloads on both CPU and GPU. + - Guide users how to validate the GPU or CPU usages for TensorFlow* workloads on Intel CPU or GPU, using ResNet50v1.5 as an example. -## Environment Setup -This sample requires two additional pip packages: tensorflow_hub and ipykerenl. -Therefore users need to clone the tensorflow conda environment into users' home folder and install those additional packages accordingly. -Please follow bellow steps to setup GPU environment. -1. Source oneAPI environment variables: ``` $source $HOME/intel/oneapi/intelpython/bin/activate ``` -2. Create conda env: ```$conda create --name user-tensorflow-gpu --clone tensorflow-gpu ``` -3. Activate the created conda env: ```$source activate user-tensorflow-gpu ``` -4. Install the required packages: ```(user-tensorflow-gpu) $pip install -r requirements.txt ``` -5. Deactivate conda env: ```(user-tensorflow-gpu)$conda deactivate ``` -6. Register the kernel to Jupyter NB: ``` $~/.conda/envs/user-tensorflow-gpu/bin/python -m ipykernel install --user --name=user-tensorflow-gpu ``` -Once users finish GPU environment setup, please do the same steps but remove "-gpu" from above steps. -In the end, you will have two new conda environments which are user-tensorflow-gpu and user-tensorflow +## Prerequisites -## How to Build and Run +| Optimized for | Description +|:--- |:--- +| OS | Ubuntu* 22.04 +| Hardware | Intel® Xeon® scalable processor family
Intel® Data Center GPU Max Series
Intel® Data Center GPU Flex Series
Intel® Arc™ A-Series | +| Software | Intel® Extension for TensorFlow* (ITEX) -You can run the Jupyter notebook with the sample code on your local -server or download the sample code from the notebook as a Python file and run it locally. +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). -### Run the Sample in Jupyter Notebook +## Key implementation details +1. leverage the [resnet50 inference sample](https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/infer_resnet50) from intel-extension-for-tensorflow +2. use the resnet50v1.5 pretrained model from TensorFlow Hub +3. infernece with images in intel caffe github +4. guide users how to use different conda environment to run on Intel CPU and GPU +5. analyze oneDNN verbose logs to validate GPU or CPU usage -To open the Jupyter notebook on your local server: -1. Start the Jupyter notebook server. ``` jupyter notebook --ip=0.0.0.0 ``` - -2. Open the ``ResNet50_Inference.ipynb`` file in the Notebook Dashboard. +## Environment Setup +You will need to download and install the following toolkits, tools, and components to use the sample. + +**1. Get Intel® AI Tools** + +Required AI Tools: `Intel® Extension for TensorFlow*` +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. + +**2. Install dependencies** +``` +pip install -r requirements.txt +``` +**Install Jupyter Notebook** by running `pip install notebook`. Alternatively, see [Installing Jupyter](https://jupyter.org/install) for detailed installation instructions. + +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](#environment-setup) is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Activate the Conda environment: +``` +conda activate tensorflow-gpu ## For the system with Intel GPU +conda activate tensorflow ## For the system with Intel CPU +``` +3. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted +``` +4. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` +5. Follow the instructions to open the URL with the token in your browser. +6. Select the Notebook: +``` +ResNet50_Inference.ipynb +``` +7. Change the kernel to `tensorflow-gpu` for system with Intel GPU or to `tensorflow` for system with Intel CPU. +8. Run every cell in the Notebook in sequence. + +### Conda/PIP +> **Note**: Make sure your Conda/Python environment with AI Tools installed is activated +1. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted +``` +2. Launch Jupyter Notebook: +> **Note**: You might need to register Conda kernel to Jupyter Notebook kernel, +feel free to check [the instruction](https://github.com/IntelAI/models/tree/master/docs/notebooks/perf_analysis#option-1-conda-environment-creation) +``` +jupyter notebook --ip=0.0.0.0 +``` +4. Follow the instructions to open the URL with the token in your browser. +5. Select the Notebook: +``` +ResNet50_Inference.ipynb +``` +7. Change the kernel to `tensorflow-gpu` for system with Intel GPU or to `tensorflow` for system with Intel CPU. +8. Run every cell in the Notebook in sequence. + +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. -3. Select the related jupyter kernel. In this example, select 'Kernel' -> 'Change kernel' -> user-tensorflow-gpu for GPU run as the first step. - -4. Run the cells in the Jupyter notebook sequentially by clicking the **Run** button. +### Troubleshooting +If an error occurs, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits. +[Learn more](https://www.intel.com/content/www/us/en/docs/oneapi/user-guide-diagnostic-utility/2024-0/overview.html) -6. select user-tensorflow jupyter kernel and run again from beginning for CPU run. ---- -**NOTE** +## Example Output +With successful execution, it will print out `[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]` in the terminal. -In the jupyter page, be sure to select the correct kernel. In this example, select 'Kernel' -> 'Change kernel' -> user-tensorflow-gpu or user-tensorflow. ---- +## Related Samples -### Troubleshooting -If an error occurs, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits. -[Learn more](https://software.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html) +Find more examples in [Intel® Extension for TensorFlow* (ITEX) examples documentation]([https://intel.github.io/intel-extension-for-tensorflow/latest/examples/README.html). ## License + Code samples are licensed under the MIT license. See -[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. -Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) -After learning how to use the extensions for Intel oneAPI Toolkits, return to this readme for instructions on how to build and run a sample. +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/ResNet50_Inference.ipynb b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/ResNet50_Inference.ipynb index 899bb19c3c..be456ccf94 100755 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/ResNet50_Inference.ipynb +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/ResNet50_Inference.ipynb @@ -216,7 +216,7 @@ "metadata": {}, "outputs": [], "source": [ - "!wget https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/Libraries/oneDNN/tutorials/profiling/profile_utils.py" + "!wget https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/development/Libraries/oneDNN/tutorials/profiling/profile_utils.py" ] }, { diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/sample.json b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/sample.json index 803bdc9043..55522107c0 100755 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/sample.json +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/sample.json @@ -15,12 +15,14 @@ "env": [ "source /intel/oneapi/intelpython/bin/activate", "conda activate tensorflow", - "pip install -r requirements.txt", + "pip install -r requirements.txt --no-deps", + "pip install tensorflow==2.15.0.post1", "pip install jupyter ipykernel", "python -m ipykernel install --user --name=tensorflow", "conda deactivate", "conda activate tensorflow-gpu", - "pip install -r requirements.txt", + "pip install -r requirements.txt --no-deps", + "pip install tensorflow==2.15.0.post1", "pip install jupyter ipykernel", "python -m ipykernel install --user --name=tensorflow-gpu", "conda deactivate" diff --git a/AI-and-Analytics/Getting-Started-Samples/Intel_oneCCL_Bindings_For_PyTorch_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/Intel_oneCCL_Bindings_For_PyTorch_GettingStarted/README.md index d63c061f47..2d7e42e351 100644 --- a/AI-and-Analytics/Getting-Started-Samples/Intel_oneCCL_Bindings_For_PyTorch_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/Intel_oneCCL_Bindings_For_PyTorch_GettingStarted/README.md @@ -7,6 +7,10 @@ The oneAPI Collective Communications Library Bindings for PyTorch* (oneCCL Bindi | What you will learn | How to get started with oneCCL Bindings for PyTorch* | Time to complete | 60 minutes +## Purpose + +From this sample code, you will learn how to perform distributed training with oneCCL in PyTorch*. The `oneCCL_Bindings_GettingStarted.ipynb` Jupyter Notebook targets both CPUs and GPUs using oneCCL Bindings for PyTorch*. + ## Prerequisites | Optimized for | Description @@ -15,20 +19,7 @@ The oneAPI Collective Communications Library Bindings for PyTorch* (oneCCL Bindi | Hardware | Intel® Xeon® scalable processor family
Intel® Data Center GPU | Software | Intel® Extension for PyTorch (IPEX) - -### For Local Development Environments - -You will need to download and install the following toolkits, tools, and components to use the sample. - -- **Intel® AI Tools** - - You can get the AI Tools from [the product page](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html).
See [*Get Started with the Intel® AI Tools for Linux**](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux) for AI Tools installation information and post-installation steps and scripts. - - oneCCL Bindings for PyTorch* is ready for use once you finish the Intel® AI Tools installation via Offline Installer and have run the post-installation script. - - - You can refer to the *[Get Started with the Intel® AI Tools for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux-get-started-with-the-intel-ai-analytics-toolkit)* for post-installation steps and scripts. - +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). ## Key Implementation Details @@ -43,44 +34,41 @@ The Jupyter Notebook also demonstrates how to change PyTorch* distributed worklo >- [Intel® oneCCL Bindings for PyTorch*](https://github.com/intel/torch-ccl) >- [Distributed Training with oneCCL in PyTorch*](https://github.com/intel/optimized-models/tree/master/pytorch/distributed) -## Purpose - -From this sample code, you will learn how to perform distributed training with oneCCL in PyTorch*. The `oneCCL_Bindings_GettingStarted.ipynb` Jupyter Notebook targets both CPUs and GPUs using oneCCL Bindings for PyTorch*. - ## Run the `oneCCL Bindings for PyTorch* Getting Started` Sample -If you have already set up the PIP or Conda environment and installed AI Tools go directly to Run the Notebook. -### Steps for Intel AI Tools Offline Installer - - -> **Note**: If you have not already done so, set up your CLI -> environment by sourcing the `setvars` script in the root of your oneAPI installation. -> -> Linux*: -> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` -> - For private installations: ` . ~/intel/oneapi/setvars.sh` -> - For non-POSIX shells, like csh, use the following command: `bash -c 'source /setvars.sh ; exec csh'` -> -> For more information on configuring environment variables, see *[Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html)*. - -1. Read and follow the *Run Scripts and CPU Affinity* instructions at [https://github.com/intel/optimized-models/tree/master/pytorch/distributed#run-scripts--cpu-affinity](https://github.com/intel/optimized-models/tree/master/pytorch/distributed#run-scripts--cpu-affinity). - -### Run the Jupyter Notebook - -1. Change to the sample directory. -2. Launch Jupyter Notebook. - ``` - jupyter notebook --ip=0.0.0.0 --port 8888 --allow-root - ``` -3. Follow the instructions to open the URL with the token in your browser. -4. Locate and select the Notebook. - ``` - oneCCL_Bindings_GettingStarted.ipynb - ``` -5. Change your Jupyter Notebook kernel to **PyTorch** or **PyTorch-GPU**. -6. Run every cell in the Notebook in sequence. - +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) +1. If you have not already done so, activate the AI Tools bundle base environment. If you used the default location to install AI Tools, open a terminal and type the following +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If you used a separate location, open a terminal and type the following +``` +source /bin/activate +``` +2. Clone the GitHub repository: +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/Getting-Started-Samples/Intel_oneCCL_Bindings_For_PyTorch_GettingStarted/ +``` +3. Launch Jupyter Notebook. +``` +jupyter notebook --ip=0.0.0.0 --port 8888 --allow-root +``` +4. Follow the instructions to open the URL with the token in your browser. +5. Locate and select the Notebook. + ``` + oneCCL_Bindings_GettingStarted.ipynb + ``` +6. Change your Jupyter Notebook kernel to **PyTorch** or **PyTorch-GPU**. +7. Run every cell in the Notebook in sequence. + +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. ## License diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/.gitkeep b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/.gitkeep similarity index 100% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/.gitkeep rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/.gitkeep diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/License.txt b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/License.txt old mode 100755 new mode 100644 similarity index 100% rename from AI-and-Analytics/Getting-Started-Samples/IntelAIKitContainer_GettingStarted/License.txt rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/License.txt diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/IntelModin_GettingStarted.ipynb b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/Modin_GettingStarted.ipynb similarity index 100% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/IntelModin_GettingStarted.ipynb rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/Modin_GettingStarted.ipynb diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/README.md similarity index 63% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/README.md rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/README.md index aed6fc30e4..50b7e716ca 100644 --- a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/README.md @@ -1,18 +1,18 @@ -# `Intel® Modin* Get Started` Sample +# Modin* Get Started Sample -The `Intel® Modin Getting Started` sample demonstrates how to use distributed Pandas using the Intel® Distribution of Modin* package. It demonstrates how to use software products that can be found in the [Intel® AI Tools](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). +The `Modin* Getting Started` sample demonstrates how to use distributed Pandas using the Modin package. | Area | Description | :--- | :--- -| What you will learn | Basic Intel® Distribution of Modin* programming model for Intel processors -| Time to complete | 5 to 8 minutes | Category | Getting Started +| What you will learn | Basic Modin* programming model for Intel processors +| Time to complete | 5 to 8 minutes ## Purpose -Intel® Distribution of Modin* uses Ray or Dask to provide a method to speed up your Pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin* provides integration and compatibility with existing Pandas code. +Modin uses Ray or Dask to provide a method to speed up your Pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides integration and compatibility with existing Pandas code. -In this sample, you will run Intel® Distribution of Modin*-accelerated Pandas functions and note the performance gain when compared to "stock" (or standard) Pandas functions. +In this sample, you will run Modin-accelerated Pandas functions and note the performance gain when compared to "stock" (or standard) Pandas functions. ## Prerequisites @@ -20,26 +20,26 @@ In this sample, you will run Intel® Distribution of Modin*-accelerated Pandas f | :--- | :--- | OS | Ubuntu* 18.04 (or newer) | Hardware | Intel® Atom® processors
Intel® Core™ processor family
Intel® Xeon® processor family
Intel® Xeon® Scalable Performance processor family -| Software | Intel® Distribution of Modin* +| Software | Modin ## Key Implementation Details This get started sample code is implemented for CPU using the Python language. The example assumes you have Pandas and Modin installed inside a conda environment. -## Configure Environment +## Environment Setup -1. Install Intel® Distribution of Modin* in a new conda environment. +1. Install Modin in a new conda environment. >**Note:** replace python=3.x with your own Python version ``` - conda create -n aikit-modin python=3.x -y - conda activate aikit-modin - conda install modin-all -c intel -y + conda create -n modin python=3.x -y + conda activate modin + conda install modin-all -c conda-forge -y ``` 2. Install Matplotlib. ``` - conda install -c intel matplotlib -y + conda install -c conda-forge matplotlib -y ``` 3. Install Jupyter Notebook. @@ -52,9 +52,9 @@ This get started sample code is implemented for CPU using the Python language. T conda install ipykernel python -m ipykernel install --user --name usr_modin ``` -## Run the `Intel® Modin* Get Started` Sample +## Run the `Modin* Get Started` Sample -You can run the Jupyter notebook with the sample code on your local server or download the sample code from the notebook as a Python file and run it locally. Visit [Intel® Distribution of Modin Getting Started Guide](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-distribution-of-modin-getting-started-guide.html) for more information. +You can run the Jupyter notebook with the sample code on your local server or download the sample code from the notebook as a Python file and run it locally. ### Run the Sample in Visual Studio Code* (Optional) @@ -87,29 +87,33 @@ To learn more about the extensions, see 3. Locate and open the Notebook. ``` - IntelModin_GettingStarted.ipynb + Modin_GettingStarted.ipynb ``` 4. Click the **Run** button to move through the cells in sequence. ### Run the Python Script Locally -1. Convert ``IntelModin_GettingStarted.ipynb`` to a Python file. There are two options. +1. Convert ``Modin_GettingStarted.ipynb`` to a Python file. There are two options. 1. Open the notebook and download the script as Python file: **File > Download as > Python (py)**. 2. Convert the notebook file to a Python script using commands similar to the following. ``` - jupyter nbconvert --to python IntelModin_GettingStarted.ipynb + jupyter nbconvert --to python Modin_GettingStarted.ipynb ``` 2. Run the Python script. ``` - ipython IntelModin_GettingStarted.py + ipython Modin_GettingStarted.py ``` ### Expected Output -The expected cell output is shown in the `IntelModin_GettingStarted.ipynb` Notebook. +The expected cell output is shown in the `Modin_GettingStarted.ipynb` Notebook. + +## Related Samples + +* [Modin Vs. Pandas Performance](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas) ## License @@ -117,3 +121,5 @@ Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/requirements.txt b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/requirements.txt similarity index 100% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/requirements.txt rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/requirements.txt diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/sample.json b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/sample.json old mode 100755 new mode 100644 similarity index 68% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/sample.json rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/sample.json index f3b4b60c60..270b1249f6 --- a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/sample.json +++ b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/sample.json @@ -1,8 +1,8 @@ { "guid": "AE280EFE-9EB1-406D-B32D-5991F707E195", - "name": "Intel® Distribution of Modin* Getting Started", + "name": "Modin* Getting Started", "categories": ["Toolkit/oneAPI AI And Analytics/Getting Started"], - "description": "This sample illustrates how to use Modin accelerated Pandas functions and notes the performance gain when compared to standard Pandas functions", + "description": "This sample illustrates how to use Modin* accelerated Pandas functions and notes the performance gain when compared to standard Pandas functions", "builder": ["cli"], "languages": [{"python":{}}], "os":["linux"], @@ -19,8 +19,8 @@ "conda activate intel-aikit-modin", "pip install -r requirements.txt # Installing notebook's dependencies", "pip install runipy # Installing 'runipy' for extended abilities to execute the notebook", - "runipy IntelModin_GettingStarted.ipynb # Test 'Modin is faster than pandas' case", - "MODIN_CPUS=1 runipy IntelModin_GettingStarted.ipynb # Test 'Modin is slower than pandas' case" + "runipy Modin_GettingStarted.ipynb # Test 'Modin* is faster than pandas' case", + "MODIN_CPUS=1 runipy Modin_GettingStarted.ipynb # Test 'Modin is slower than pandas' case" ] } ] diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/third-party-programs.txt b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/third-party-programs.txt similarity index 98% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/third-party-programs.txt rename to AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/third-party-programs.txt index 90daff458d..e9f8042d0a 100644 --- a/AI-and-Analytics/Getting-Started-Samples/IntelModin_GettingStarted/third-party-programs.txt +++ b/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted/third-party-programs.txt @@ -1,253 +1,253 @@ -oneAPI Code Samples - Third Party Programs File - -This file contains the list of third party software ("third party programs") -contained in the Intel software and their required notices and/or license -terms. This third party software, even if included with the distribution of the -Intel software, may be governed by separate license terms, including without -limitation, third party license terms, other Intel software license terms, and -open source software license terms. These separate license terms govern your use -of the third party programs as set forth in the “third-party-programs.txt” or -other similarly named text file. - -Third party programs and their corresponding required notices and/or license -terms are listed below. - --------------------------------------------------------------------------------- - -1. Nothings STB Libraries - -stb/LICENSE - - This software is available under 2 licenses -- choose whichever you prefer. - ------------------------------------------------------------------------------ - ALTERNATIVE A - MIT License - Copyright (c) 2017 Sean Barrett - Permission is hereby granted, free of charge, to any person obtaining a copy of - this software and associated documentation files (the "Software"), to deal in - the Software without restriction, including without limitation the rights to - use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies - of the Software, and to permit persons to whom the Software is furnished to do - so, subject to the following conditions: - The above copyright notice and this permission notice shall be included in all - copies or substantial portions of the Software. - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - ------------------------------------------------------------------------------ - ALTERNATIVE B - Public Domain (www.unlicense.org) - This is free and unencumbered software released into the public domain. - Anyone is free to copy, modify, publish, use, compile, sell, or distribute this - software, either in source code form or as a compiled binary, for any purpose, - commercial or non-commercial, and by any means. - In jurisdictions that recognize copyright laws, the author or authors of this - software dedicate any and all copyright interest in the software to the public - domain. We make this dedication for the benefit of the public at large and to - the detriment of our heirs and successors. We intend this dedication to be an - overt act of relinquishment in perpetuity of all present and future rights to - this software under copyright law. - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION - WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. - --------------------------------------------------------------------------------- - -2. FGPA example designs-gzip - - SDL2.0 - -zlib License - - - This software is provided 'as-is', without any express or implied - warranty. In no event will the authors be held liable for any damages - arising from the use of this software. - - Permission is granted to anyone to use this software for any purpose, - including commercial applications, and to alter it and redistribute it - freely, subject to the following restrictions: - - 1. The origin of this software must not be misrepresented; you must not - claim that you wrote the original software. If you use this software - in a product, an acknowledgment in the product documentation would be - appreciated but is not required. - 2. Altered source versions must be plainly marked as such, and must not be - misrepresented as being the original software. - 3. This notice may not be removed or altered from any source distribution. - - --------------------------------------------------------------------------------- - -3. Nbody - (c) 2019 Fabio Baruffa - - Plotly.js - Copyright (c) 2020 Plotly, Inc - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. -© 2020 GitHub, Inc. - --------------------------------------------------------------------------------- - -4. GNU-EFI - Copyright (c) 1998-2000 Intel Corporation - -The files in the "lib" and "inc" subdirectories are using the EFI Application -Toolkit distributed by Intel at http://developer.intel.com/technology/efi - -This code is covered by the following agreement: - -Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - -Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - -Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, -INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND -FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE -LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION -ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT -TO CHANGE WITHOUT NOTICE. - --------------------------------------------------------------------------------- - -5. Edk2 - Copyright (c) 2019, Intel Corporation. All rights reserved. - - Edk2 Basetools - Copyright (c) 2019, Intel Corporation. All rights reserved. - -SPDX-License-Identifier: BSD-2-Clause-Patent - --------------------------------------------------------------------------------- - -6. Heat Transmission - -GNU LESSER GENERAL PUBLIC LICENSE -Version 3, 29 June 2007 - -Copyright © 2007 Free Software Foundation, Inc. - -Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. - -This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. - -0. Additional Definitions. -As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. - -“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. - -An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. - -A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. - -The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. - -The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. - -1. Exception to Section 3 of the GNU GPL. -You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. - -2. Conveying Modified Versions. -If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: - -a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or -b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. -3. Object Code Incorporating Material from Library Header Files. -The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: - -a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. -b) Accompany the object code with a copy of the GNU GPL and this license document. -4. Combined Works. -You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: - -a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. -b) Accompany the Combined Work with a copy of the GNU GPL and this license document. -c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. -d) Do one of the following: -0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. -1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. -e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) -5. Combined Libraries. -You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: - -a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. -b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. -6. Revised Versions of the GNU Lesser General Public License. -The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. - -Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. - -If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. - --------------------------------------------------------------------------------- -7. Rodinia - Copyright (c)2008-2011 University of Virginia -All rights reserved. - -Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: - - * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. - * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -If you use this software or a modified version of it, please cite the most relevant among the following papers: - - - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International -Symposium on Computer Architecture (ISCA), June 2010. - - - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. -Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium -on Workload Characterization, Oct 2009. - -- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization -for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International -Conference on Supercomputing (ICS), June 2009. - -- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems -Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International -Symposium on Computer Architecture (ISCA), June 2009. - -- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: -A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel -and Distributed Processing Symposium (IPDPS), May 2009. - -- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance -Study of General Purpose Applications on Graphics Processors using CUDA" Journal of -Parallel and Distributed Computing, Elsevier, June 2008. - --------------------------------------------------------------------------------- -Other names and brands may be claimed as the property of others. - +oneAPI Code Samples - Third Party Programs File + +This file contains the list of third party software ("third party programs") +contained in the Intel software and their required notices and/or license +terms. This third party software, even if included with the distribution of the +Intel software, may be governed by separate license terms, including without +limitation, third party license terms, other Intel software license terms, and +open source software license terms. These separate license terms govern your use +of the third party programs as set forth in the “third-party-programs.txt” or +other similarly named text file. + +Third party programs and their corresponding required notices and/or license +terms are listed below. + +-------------------------------------------------------------------------------- + +1. Nothings STB Libraries + +stb/LICENSE + + This software is available under 2 licenses -- choose whichever you prefer. + ------------------------------------------------------------------------------ + ALTERNATIVE A - MIT License + Copyright (c) 2017 Sean Barrett + Permission is hereby granted, free of charge, to any person obtaining a copy of + this software and associated documentation files (the "Software"), to deal in + the Software without restriction, including without limitation the rights to + use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies + of the Software, and to permit persons to whom the Software is furnished to do + so, subject to the following conditions: + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + ------------------------------------------------------------------------------ + ALTERNATIVE B - Public Domain (www.unlicense.org) + This is free and unencumbered software released into the public domain. + Anyone is free to copy, modify, publish, use, compile, sell, or distribute this + software, either in source code form or as a compiled binary, for any purpose, + commercial or non-commercial, and by any means. + In jurisdictions that recognize copyright laws, the author or authors of this + software dedicate any and all copyright interest in the software to the public + domain. We make this dedication for the benefit of the public at large and to + the detriment of our heirs and successors. We intend this dedication to be an + overt act of relinquishment in perpetuity of all present and future rights to + this software under copyright law. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION + WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +-------------------------------------------------------------------------------- + +2. FGPA example designs-gzip + + SDL2.0 + +zlib License + + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + +-------------------------------------------------------------------------------- + +3. Nbody + (c) 2019 Fabio Baruffa + + Plotly.js + Copyright (c) 2020 Plotly, Inc + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +© 2020 GitHub, Inc. + +-------------------------------------------------------------------------------- + +4. GNU-EFI + Copyright (c) 1998-2000 Intel Corporation + +The files in the "lib" and "inc" subdirectories are using the EFI Application +Toolkit distributed by Intel at http://developer.intel.com/technology/efi + +This code is covered by the following agreement: + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION +ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT +TO CHANGE WITHOUT NOTICE. + +-------------------------------------------------------------------------------- + +5. Edk2 + Copyright (c) 2019, Intel Corporation. All rights reserved. + + Edk2 Basetools + Copyright (c) 2019, Intel Corporation. All rights reserved. + +SPDX-License-Identifier: BSD-2-Clause-Patent + +-------------------------------------------------------------------------------- + +6. Heat Transmission + +GNU LESSER GENERAL PUBLIC LICENSE +Version 3, 29 June 2007 + +Copyright © 2007 Free Software Foundation, Inc. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. + +0. Additional Definitions. +As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. + +“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. + +An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. + +A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. + +The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. + +The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. + +1. Exception to Section 3 of the GNU GPL. +You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. + +2. Conveying Modified Versions. +If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: + +a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or +b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. +3. Object Code Incorporating Material from Library Header Files. +The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: + +a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the object code with a copy of the GNU GPL and this license document. +4. Combined Works. +You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: + +a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the Combined Work with a copy of the GNU GPL and this license document. +c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. +d) Do one of the following: +0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. +1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. +e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) +5. Combined Libraries. +You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: + +a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. +b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. +6. Revised Versions of the GNU Lesser General Public License. +The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. + +If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. + +-------------------------------------------------------------------------------- +7. Rodinia + Copyright (c)2008-2011 University of Virginia +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +If you use this software or a modified version of it, please cite the most relevant among the following papers: + + - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International +Symposium on Computer Architecture (ISCA), June 2010. + + - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. +Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium +on Workload Characterization, Oct 2009. + +- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization +for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International +Conference on Supercomputing (ICS), June 2009. + +- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems +Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International +Symposium on Computer Architecture (ISCA), June 2009. + +- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: +A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel +and Distributed Processing Symposium (IPDPS), May 2009. + +- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance +Study of General Purpose Applications on Graphics Processors using CUDA" Journal of +Parallel and Distributed Computing, Elsevier, June 2008. + +-------------------------------------------------------------------------------- +Other names and brands may be claimed as the property of others. + -------------------------------------------------------------------------------- \ No newline at end of file diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/IntelModin_Vs_Pandas.ipynb b/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/Modin_Vs_Pandas.ipynb similarity index 100% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/IntelModin_Vs_Pandas.ipynb rename to AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/Modin_Vs_Pandas.ipynb diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/README.md b/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/README.md similarity index 58% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/README.md rename to AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/README.md index cd1a77c7bb..14f3f0f442 100644 --- a/AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/README.md @@ -1,16 +1,16 @@ -# `Intel® Modin* Vs. Pandas Performance` Sample +# Modin* Vs. Pandas Performance Sample -The `Intel® Modin* Vs. Pandas Performance` code illustrates how to use Modin* to replace the Pandas API. The sample compares the performance of Intel® Distribution of Modin* and the performance of Pandas for specific dataframe operations. +The `Modin* Vs. Pandas Performance` code illustrates how to use Modin* to replace the Pandas API. The sample compares the performance of Modin* and the performance of Pandas for specific dataframe operations. | Area | Description |:--- |:--- -| What you will learn | How to accelerate the Pandas API using Intel® Distribution of Modin*. -| Time to complete | Less than 10 minutes | Category | Concepts and Functionality +| What you will learn | How to accelerate the Pandas API using Modin*. +| Time to complete | Less than 10 minutes ## Purpose -Intel® Distribution of Modin* accelerates Pandas operations using Ray or Dask execution engine. The distribution provides compatibility and integration with the existing Pandas code. The sample code demonstrates how to perform some basic dataframe operations using Pandas and Intel® Distribution of Modin*. You will be able to compare the performance difference between the two methods. +Modin accelerates Pandas operations using Ray or Dask execution engine. The distribution provides compatibility and integration with the existing Pandas code. The sample code demonstrates how to perform some basic dataframe operations using Pandas and Modin. You will be able to compare the performance difference between the two methods. You can run the sample locally or in Google Colaboratory (Colab). ## Prerequisites @@ -19,31 +19,31 @@ You can run the sample locally or in Google Colaboratory (Colab). |:--- |:--- | OS | Ubuntu* 20.04 (or newer) | Hardware | Intel® Core™ Gen10 Processor
Intel® Xeon® Scalable Performance processors -| Software | Intel® Distribution of Modin* +| Software | Modin* ## Key Implementation Details -This code sample is implemented for CPU using Python programming language. The sample requires NumPy, Pandas, Modin libraries, and the time module in Python. +This code sample is implemented for CPU using Python programming language. The sample requires NumPy, Pandas, Modin* libraries, and the time module in Python. -## Run the `Intel® Modin Vs Pandas Performance` Sample Locally +## Environment Setup -If you want to run the sample on a local system using a command-line interface (CLI), you must install the Intel® Distribution of Modin* in a new Conda* environment first. +If you want to run the sample on a local system using a command-line interface (CLI), you must install the Modin in a new Conda* environment first. -### Install the Intel® Distribution of Modin* +### Install Modin* 1. Create a Conda environment. ``` - conda create --name aikit-modin + conda create --name modin ``` 2. Activate the Conda environment. ``` - source activate aikit-modin + source activate modin ``` 3. Remove existing versions of Modin* (if any exist). ``` conda remove modin --y ``` -4. Install Intel® Distribution of Modin* (v0.12.1 or newer). +4. Install Modin (v0.12.1 or newer). ``` pip install modin[all]==0.12.1 ``` @@ -58,16 +58,16 @@ If you want to run the sample on a local system using a command-line interface ( ``` ### Run the Sample -1. Change to the directory containing the `IntelModin_Vs_Pandas.ipynb` notebook file on your local system. +1. Change to the directory containing the `Modin_Vs_Pandas.ipynb` notebook file on your local system. 2. Run the sample notebook. ``` - ipython IntelModin_Vs_Pandas.ipynb + ipython Modin_Vs_Pandas.ipynb ``` -## Run the `Intel® Modin Vs Pandas Performance` Sample in Google Colaboratory +## Run the `Modin* Vs Pandas Performance` Sample in Google Colaboratory -1. Change to the directory containing the `IntelModin_Vs_Pandas.ipynb` notebook file on your local system. +1. Change to the directory containing the `Modin_Vs_Pandas.ipynb` notebook file on your local system. 2. Open the notebook file, and remove the prepended number sign (#) symbol from the following lines: ``` @@ -75,7 +75,7 @@ If you want to run the sample on a local system using a command-line interface ( #!pip install numpy #!pip install pandas ``` - These changes will install the Intel® Distribution of Modin* and the NumPy and Pandas libraries when run in the Colab notebook. + These changes will install the Modin and the NumPy and Pandas libraries when run in the Colab notebook. 3. Save your changes. @@ -100,7 +100,11 @@ CPU times: user 8.47 s, sys: 132 ms, total: 8.6 s Wall time: 8.57 s ``` -Example expected cell output is included in `IntelModin_Vs_Pandas.ipynb`. +Example expected cell output is included in `Modin_Vs_Pandas.ipynb`. + +## Related Samples + +* [Modin Get Started Sample](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Modin_GettingStarted) ## License @@ -108,3 +112,5 @@ Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. Third party program licenses are at [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/sample.json b/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/sample.json similarity index 74% rename from AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/sample.json rename to AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/sample.json index aef64a246f..bab9d6980f 100644 --- a/AI-and-Analytics/Getting-Started-Samples/IntelModin_Vs_Pandas/sample.json +++ b/AI-and-Analytics/Getting-Started-Samples/Modin_Vs_Pandas/sample.json @@ -1,8 +1,8 @@ { "guid": "FE479C5C-C7A0-4612-B8D0-F83D07155411", - "name": "Intel® Modin Vs. Pandas Performance", + "name": "Modin* Vs. Pandas Performance", "categories": ["Toolkit/oneAPI AI And Analytics/Getting Started"], - "description": "This sample code illustrates how Intel® Modin accelerates the performance of Pandas for computational operations on a dataframe.", + "description": "This sample code illustrates how Modin* accelerates the performance of Pandas for computational operations on a dataframe.", "builder": ["cli"], "languages": [{ "python": {} @@ -21,9 +21,9 @@ "pip install numpy", "pip install pandas", "pip install ipython # To run colab notebook", - "ipython IntelModin_Vs_Pandas.ipynb # Execute the notebook" + "ipython Modin_Vs_Pandas.ipynb # Execute the notebook" ] }] }, "expertise": "Concepts and Functionality" -} \ No newline at end of file +} diff --git a/AI-and-Analytics/Getting-Started-Samples/README.md b/AI-and-Analytics/Getting-Started-Samples/README.md index ff98dbaf2d..a8d82bd7da 100644 --- a/AI-and-Analytics/Getting-Started-Samples/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/README.md @@ -18,9 +18,8 @@ Third party program Licenses can be found here: [third-party-programs.txt](https |--------------------------| --------- | ------------------------------------------------ | - |Inference Optimization| Intel® Neural Compressor (INC) | [Intel® Neural Compressor (INC) Sample-for-PyTorch](INC-Quantization-Sample-for-PyTorch) | Performs INT8 quantization on a Hugging Face BERT model. |Inference Optimization| Intel® Neural Compressor (INC) | [Intel® Neural Compressor (INC) Sample-for-Tensorflow](INC-Sample-for-Tensorflow) | Quantizes a FP32 model into INT8 by Intel® Neural Compressor (INC) and compares the performance between FP32 and INT8. -|Data Analytics
Classical Machine Learning
Deep Learning
Inference Optimization | oneAPI docker image | [IntelAIKitContainer_GettingStarted](IntelAIKitContainer_GettingStarted) | Configuration script to automatically configure the environment. -|Data Analytics
Classical Machine Learning | Modin | [IntelModin_GettingStarted](IntelModin_GettingStarted) | Run Modin-accelerated Pandas functions and note the performance gain. -|Data Analytics
Classical Machine Learning | Modin |[IntelModin_Vs_Pandas](IntelModin_Vs_Pandas)| Compares the performance of Intel® Distribution of Modin* and the performance of Pandas. +|Data Analytics
Classical Machine Learning | Modin* | [Modin_GettingStarted](Modin_GettingStarted) | Run Modin*-accelerated Pandas functions and note the performance gain. +|Data Analytics
Classical Machine Learning | Modin* |[Modin_Vs_Pandas](Modin_Vs_Pandas)| Compares the performance of Intel® Distribution of Modin* and the performance of Pandas. |Classical Machine Learning| Intel® Optimization for XGBoost* | [IntelPython_XGBoost_GettingStarted](IntelPython_XGBoost_GettingStarted) | Set up and trains an XGBoost* model on datasets for prediction. |Classical Machine Learning| daal4py | [IntelPython_daal4py_GettingStarted](IntelPython_daal4py_GettingStarted) | Batch linear regression using the Python API package daal4py from oneAPI Data Analytics Library (oneDAL). |Deep Learning
Inference Optimization| Intel® Optimization for TensorFlow* | [IntelTensorFlow_GettingStarted](IntelTensorFlow_GettingStarted) | A simple training example for TensorFlow. diff --git a/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/03_dpex_Pairwise_Distance/lab/pair_wise_kernel.py b/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/03_dpex_Pairwise_Distance/lab/pair_wise_kernel.py index 5f35c3a6a0..f2cb40c2c7 100644 --- a/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/03_dpex_Pairwise_Distance/lab/pair_wise_kernel.py +++ b/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/03_dpex_Pairwise_Distance/lab/pair_wise_kernel.py @@ -22,7 +22,7 @@ def pairwise_python(X1, X2, D): def pw_distance(X1, X2, D): - pairwise_python[X1.shape[0],](X1, X2, D) + pairwise_python[nbdx.Range(X1.shape[0])](X1, X2, D) base_pair_wise_gpu.run("Pairwise Distance Kernel", pw_distance) diff --git a/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/05_dpex_Kmeans/lab/kmeans_kernel_atomic.py b/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/05_dpex_Kmeans/lab/kmeans_kernel_atomic.py index 175e37cbdd..0c2778fb12 100644 --- a/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/05_dpex_Kmeans/lab/kmeans_kernel_atomic.py +++ b/AI-and-Analytics/Jupyter/Numba_dpex_Essentials_training/05_dpex_Kmeans/lab/kmeans_kernel_atomic.py @@ -60,20 +60,20 @@ def kmeans_kernel( num_points, num_centroids, ): - copy_arrayC[num_centroids,](arrayC, arrayP) + copy_arrayC[nbdx.Range(num_centroids)](arrayC, arrayP) for i in range(niters): - groupByCluster[num_points,]( + groupByCluster[nbdx.Range(num_points)]( arrayP, arrayPcluster, arrayC, num_points, num_centroids ) - calCentroidsSum1[num_centroids,](arrayCsum, arrayCnumpoint) + calCentroidsSum1[nbdx.Range(num_centroids)](arrayCsum, arrayCnumpoint) - calCentroidsSum2[num_points,]( + calCentroidsSum2[nbdx.Range(num_points)]( arrayP, arrayPcluster, arrayCsum, arrayCnumpoint ) - updateCentroids[num_centroids,]( + updateCentroids[nbdx.Range(num_centroids)]( arrayC, arrayCsum, arrayCnumpoint, num_centroids ) diff --git a/DirectProgramming/C++SYCL/DenseLinearAlgebra/matrix_mul/README.md b/DirectProgramming/C++SYCL/DenseLinearAlgebra/matrix_mul/README.md index 204d336dfd..eab6bb10e0 100644 --- a/DirectProgramming/C++SYCL/DenseLinearAlgebra/matrix_mul/README.md +++ b/DirectProgramming/C++SYCL/DenseLinearAlgebra/matrix_mul/README.md @@ -134,6 +134,17 @@ P = m_size / 2; ``` > **Note**: The size value must be in multiples of **8**. +## Example Output +``` +./matrix_mul_dpc + +Device: Intel(R) Iris(R) Xe Graphics + +Problem size: c(150,600) = a(150,300) * b(300,600) + +Result of matrix multiplication using SYCL: Success - The results are correct! +``` + ## License Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/CMakeLists.txt new file mode 100644 index 0000000000..e0bded3dae --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/CMakeLists.txt @@ -0,0 +1,4 @@ +cmake_minimum_required (VERSION 3.4) +set(CMAKE_CXX_COMPILER "icpx") +project (Iso3DFD) +add_subdirectory (src) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop2.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop2.png new file mode 100644 index 0000000000..5cacb2ff4c Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop2.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop3.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop3.png new file mode 100644 index 0000000000..364ad78531 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop3.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop4.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop4.png new file mode 100644 index 0000000000..70f4e63e65 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop4.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop5.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop5.png new file mode 100644 index 0000000000..7328611cdb Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/prop5.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/r1.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/r1.png new file mode 100644 index 0000000000..2e4b88e186 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/r1.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/stencil_mount.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/stencil_mount.png new file mode 100644 index 0000000000..15e2692eb5 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/stencil_mount.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/workgroup.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/workgroup.png new file mode 100644 index 0000000000..1c4cb03c15 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/img/workgroup.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/iso3dfd_Offload_Advisor_Analysis.ipynb b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/iso3dfd_Offload_Advisor_Analysis.ipynb new file mode 100644 index 0000000000..87cbfcb72b --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/iso3dfd_Offload_Advisor_Analysis.ipynb @@ -0,0 +1,642 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ISO3DFD and Offload Advisor Analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Objectives" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
    \n", + "
  • To run offload Advisor and generate a HTML report
  • \n", + "
  • To read and understand the metrics in the report
  • \n", + "
  • To get a performance estimation of your application on the target hardware
  • \n", + "
  • To decide which loops are good candidate for offload
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ISO3DFD Application basics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this module, initially we will assume that the developer already has a code running on a CPU. At this stage, it doesn't matter if the code is written in C/C++ or Fortran. Before porting a code on a GPU, the developer should try to understand which parts of the code should be offloaded on the GPU. This step is not always trivial because the developer needs to understand the code but also the hardware that will be used for offloading the computations.\n", + "The goal of this activity is to show how Intel® Advisor can help deciding what part of the code should or should not be offloaded on the GPU. At the end of this activity, you will be able:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Iso3DFD is a wave propagation kernel used in Oil and Gas applications. The resolution of the wave equation is based on finite differences which results in implementing a stencil in a 3D volume.\n", + "\n", + "![3D Stencil](img/stencil_mount.png)\n", + "\n", + "The general algorithm can be described as follow, using next and prev to store the pressure and vel to store velocity:
\n", + "\n", + "iterate over time steps
\n", + "|  iterate over Z
\n", + "|  |  iterate over Y
\n", + "|  |  |  iterate over X
\n", + "|  |  |  |  tmp = compute stencil for prev[x,y,z]
\n", + "|  |  |  |  next[x,y,z] = update(prev[x,y,z], next[x,y,z], vel[x,y,z])
\n", + "|  swap(prev, next)
\n", + "\n", + "If we try to extract a 2D cut of the volume at different time steps, we can see a perturbation evolving and reflecting on the edges.\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Propagation at T10Propagation at T20Propagation at T30Propagation at T40
Propagation at t10Propagation at t20Propagation at t30Propagation at t40
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compiling and running iso3DFD " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first step will be to compile and run for the first time this application. Below is the step by step guide that shows how to optimize iso3dfd. We'll start with code that runs on the CPU, then a basic implementation of GPU offload, then make several iterations to optimize the code. The below uses the Intel® Advisor analysis tool to provide performance analysis of the built applications.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Offloading modeling\n", + "The first step is to run offload modeling on the CPU only version of the application (1_CPU_only) to identify code regions that are good opportunities for GPU offload. Running accurate modeling can take considerable time as Intel® Advisor performs analysis on your project. There are two commands provided below. The first is fast, but less accurate and should only be used as a proof of concept. The second will give considerably more helpful and accurate profile information. Depending on your system, modeling may take well over an hour.\n", + "\n", + "The SYCL code below shows CPU code: Inspect code, there are no modifications necessary:\n", + "1. Inspect the code cell below and click run ▶ to save the code to file\n", + "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/1_CPU_only.cpp\n", + "//==============================================================\n", + "// Copyright 2022 Intel Corporation\n", + "//\n", + "// SPDX-License-Identifier: MIT\n", + "// =============================================================\n", + "\n", + "#include \n", + "#include \n", + "#include \n", + "\n", + "#include \"Utils.hpp\"\n", + "\n", + "void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base,\n", + " float* ptr_vel_base, float* coeff, const size_t n1,\n", + " const size_t n2, const size_t n3) {\n", + " auto dimn1n2 = n1 * n2;\n", + "\n", + " // Remove HALO from the end\n", + " auto n3_end = n3 - kHalfLength;\n", + " auto n2_end = n2 - kHalfLength;\n", + " auto n1_end = n1 - kHalfLength;\n", + "\n", + " for (auto iz = kHalfLength; iz < n3_end; iz++) {\n", + " for (auto iy = kHalfLength; iy < n2_end; iy++) {\n", + " // Calculate start pointers for the row over X dimension\n", + " float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1;\n", + " float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1;\n", + " float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1;\n", + "\n", + " // Iterate over X\n", + " for (auto ix = kHalfLength; ix < n1_end; ix++) {\n", + " // Calculate values for each cell\n", + " float value = ptr_prev[ix] * coeff[0];\n", + " for (int i = 1; i <= kHalfLength; i++) {\n", + " value +=\n", + " coeff[i] *\n", + " (ptr_prev[ix + i] + ptr_prev[ix - i] +\n", + " ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] +\n", + " ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]);\n", + " }\n", + " ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix];\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "void iso3dfd(float* next, float* prev, float* vel, float* coeff,\n", + " const size_t n1, const size_t n2, const size_t n3,\n", + " const size_t nreps) {\n", + " for (auto it = 0; it < nreps; it++) {\n", + " iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3);\n", + " // Swap the pointers for always having current values in prev array\n", + " std::swap(next, prev);\n", + " }\n", + "}\n", + "\n", + "int main(int argc, char* argv[]) {\n", + " // Arrays used to update the wavefield\n", + " float* prev;\n", + " float* next;\n", + " // Array to store wave velocity\n", + " float* vel;\n", + "\n", + " // Variables to store size of grids and number of simulation iterations\n", + " size_t n1, n2, n3;\n", + " size_t num_iterations;\n", + "\n", + " if (argc < 5) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " try {\n", + " // Parse command line arguments and increase them by HALO\n", + " n1 = std::stoi(argv[1]) + (2 * kHalfLength);\n", + " n2 = std::stoi(argv[2]) + (2 * kHalfLength);\n", + " n3 = std::stoi(argv[3]) + (2 * kHalfLength);\n", + " num_iterations = std::stoi(argv[4]);\n", + " } catch (...) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Validate input sizes for the grid\n", + " if (ValidateInput(n1, n2, n3, num_iterations)) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Compute the total size of grid\n", + " size_t nsize = n1 * n2 * n3;\n", + "\n", + " prev = new float[nsize];\n", + " next = new float[nsize];\n", + " vel = new float[nsize];\n", + "\n", + " // Compute coefficients to be used in wavefield update\n", + " float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1,\n", + " +7.572087e-2, -1.76767677e-2, +3.480962e-3,\n", + " -5.180005e-4, +5.074287e-5, -2.42812e-6};\n", + "\n", + " // Apply the DX, DY and DZ to coefficients\n", + " coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);\n", + " for (auto i = 1; i <= kHalfLength; i++) {\n", + " coeff[i] = coeff[i] / (dxyz * dxyz);\n", + " }\n", + "\n", + " // Initialize arrays and introduce initial conditions (source)\n", + " initialize(prev, next, vel, n1, n2, n3);\n", + "\n", + " std::cout << \"Running on CPU serial version\\n\";\n", + " auto start = std::chrono::steady_clock::now();\n", + "\n", + " // Invoke the driver function to perform 3D wave propagation 1 thread serial\n", + " // version\n", + " iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations);\n", + "\n", + " auto end = std::chrono::steady_clock::now();\n", + " auto time = std::chrono::duration_cast(end - start)\n", + " .count();\n", + "\n", + " printStats(time, n1, n2, n3, num_iterations);\n", + "\n", + " delete[] prev;\n", + " delete[] next;\n", + " delete[] vel;\n", + "\n", + " return 0;\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the application is created, we can run it from the command line by using few parameters as following:\n", + "src/1_CPU_only 256 256 256 100\n", + "
    \n", + "
  • bin/1_CPU_only is the binary
  • \n", + "
  • 128 128 128 are the size for the 3 dimensions, increasing it will result in more computation time
  • \n", + "
  • 100 is the number of time steps
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_cpu_only.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_cpu_only.sh; else ./run_cpu_only.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that you have been able to compile and execute the code, let's start profiling what should be offloaded !" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running Offload Advisor" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The current code is running on a CPU and is actually not even threaded. For Intel® Offload Advisor, it doesn't matter if your code is already threaded. Advisor will run several analyses on your application to extract several metric such as the number of operations, the number of memory transfers, data dependencies and many more.\n", + "We are going to detail each of these steps. Remember that our goal here is to decide if some of our loops are good candidates for offload. In this section, we will generate the report assuming that we want to offload our computations on a GPU on Intel Devcloud.\n", + "Keep in mind that if you want Advisor to extract as much information as possible, you need to compile your application with debug information (-g with intel compilers).\n", + "\n", + "The first step is to run offload modeling on the CPU only version of the application (1_CPU_only) to identify code regions that are good opportunities for GPU offload. Running accurate modeling can take considerable time as Intel® Advisor performs analysis on your project. There are two commands provided below. The first is fast, but less accurate and should only be used as a proof of concept. The second will give considerably more helpful and accurate profile information. Depending on your system, modeling may take well over an hour.\n", + "\n", + "Run one of the following from the from the \"build\" directory\n", + "```\n", + "advisor --collect=offload --config=pvc_xt_448xve --project-dir=./../advisor/1_cpu -- ./build/src/1_CPU_only 256 256 256 20\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Simple method: Use Collection Presets\n", + "For the Offload Modeling perspective, Intel Advisor has a special collection mode --collect=offload that allows you to run several analyses using only oneIntel Advisor CLI command. When you run the collection, it sequentially runs data collection and performance modeling steps.\n", + " In the commands below, make sure to replace the myApplication with your application executable path and name before executing a command. If your application requires additional command line options, add them after the executable name.\n", + "```\n", + "advisor --collect=offload --project-dir=./advi_results -- ./myApplication \n", + "```\n", + "The iso3DFD CPU code can be run using\n", + "```\n", + "advisor --collect=offload --config=pvc_xt_448xve --project-dir=./../advisor/1_cpu -- ./build/src/1_CPU_only 256 256 256 20\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_offload_advisor.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_offload_advisor.sh; else ./run_offload_advisor.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Second Method to run the Offload Advisor\n", + "\n", + "### Running the Survey" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Survey is usually the first analysis you want to run with Intel® Advisor. The survey is mainly used to time your application as well as the different loops and functions. There is a minimal performance penalty at this stage. This analysis is also used to extract information embedded by the compiler in your binary. These information are mainly related to vectorization (why or why not vectorization, vectorization efficiency, etc).\n", + "\n", + "```\n", + "advisor --collect=survey --auto-finalize --static-instruction-mix -- ./build/src/1_CPU_only 128 128 128 20\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_advisor_survey.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_advisor_survey.sh; else ./run_advisor_survey.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the trip count and cache simulation \n", + "The second step to decide what should be offloaded, will be to run the trip count analysis as well as the cache simulation. This second step uses instrumentation to count how many iterations you are running in each loops. Adding the option -flop will also provide the precise number of operations executed in each of your code sections.\n", + "\n", + "In this step, we also ask advisor to run a cache simulation, specifying the memory configuration of the hardware we are targeting for offload\n", + "\n", + "Be aware that this step will take much more time than simply running your application. You can expect something like a 10x speed-down due to the many parameters Advisor tries to extract during the run.\n", + "```\n", + "advisor --collect=tripcounts --flop --auto-finalize --target-device=gen9_gt2 -- ./build/src/1_CPU_only 128 128 128 20\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_advisor_tripcounts.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_advisor_tripcounts.sh; else ./run_advisor_tripcounts.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Optional: Dependency analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Forcing threading in location where it is not supposed to happen might be quite dangerous and result in computation changes. In order to avoid parallelizing loops that cannot be parallelized, it is possible to run an additional analysis called the dependency analysis. This step was initially used to help users implementing vectorization but Offload Advisor can also use it to recommend what can be offloaded or not.\n", + "\n", + "```\n", + "advisor -collect=dependencies --loop-call-count-limit=16 --select markup=gpu_generic --filter-reductions --project-dir=./advi_results -- ./myApplication\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_cpu_only.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_cpu_only.sh; else ./run_cpu_only.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyzing the HTML report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We finally reached the last step and only need to generate our HTML report for offloading on GPU. This report will show us:\n", + "
    \n", + "
  • What is the expected speedup on the target device
  • \n", + "
  • What will most likely be our bottleneck on the target device
  • \n", + "
  • What are the good candidates for offload
  • \n", + "
  • What are the loops that should not be offloaded
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Advisor report overview\n", + "To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script.\n", + "\n", + "[View the report in HTML](reports/advisor_report_overview.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "\n", + "IFrame(src='reports/advisor-report.html', width=900, height=600)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "

Survey

\n", + "\n", + "[Tell us how we did in this module with a short survey. We will use your feedback to improve the quality and impact of these learning materials. Thanks!](https://intel.az1.qualtrics.com/jfe/form/SV_6m4G7BXPNSS7FBz)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "from IPython.display import IFrame\n", + "\n", + "IFrame(src='reports/advisor_report_overview.html', width=900, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Advisor report\n", + "To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. \n", + "\n", + "[View the report in HTML](reports/report.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "\n", + "IFrame(src='reports/report.html', width=900, height=600)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Navigate in the report and try to understand what should be the speedup, what should be offloaded and what should not be offloaded. Navigate also to the \"Offloaded Regions\" tab to see exactly which part of the code should run on the GPU." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How to remember these complex command lines ? " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You might think that the command lines we used are too complex to be remembered and you are right ! This is the reason why Advisor provides an option called --dry-run that will give you all the independent commands you need to use to run this analysis from scratch.\n", + "\n", + "Generate pre-configured command lines with --collect=offload and the --dry-run option.\n", + "The option generates:\n", + "* Commands for the Intel Advisor CLI collection workflow\n", + "* Commands that correspond to a specified accuracy level\n", + "\n", + "```\n", + "advisor --collect=offload --accuracy=low --dry-run --project-dir=./advi_results -- ./myApplication\n", + "```\n", + "\n", + "```\n", + "advisor --collect=offload --accuracy=low --dry-run -- ./build/src/1_CPU_only 128 128 128 20\n", + "```\n", + "--config can use the following devices:\n", + "
    \n", + "
  • pvc_xt_448xve
  • \n", + "
  • xehpg_512xve
  • \n", + "
  • xehpg_256xve
  • \n", + "
  • gen12_tgl
  • \n", + "
  • gen12_dg1
  • \n", + "
  • gen11_icl
  • \n", + "
  • gen11_gt2
  • \n", + "
  • gen9_gt2
  • \n", + "
  • gen9_gt3
  • \n", + "
  • gen9_gt4
  • \n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_dry_run_advisor; if [ -x \"$(command -v qsub)\" ]; then ./q run_dry_run_advisor.sh; else ./run_dry_run_advisor.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "### Next Iteration of implemeting the parallelism using SYCL\n", + "In this module\n", + "\n", + "* Started with serial C++ code that runs on the CPU. \n", + "* Used the Intel® Advisor analysis tool to provide performance analysis/projections of the application.\n", + "* Ran offload modeling on the CPU version of the application to identify code regions that are good opportunities for GPU offload.\n", + "* Reviewed the Offload report and we are ready to build an implementation of GPU offload using SYCL\n", + "* We will also make several iterations of the SYCL code to optimize the code for GPUs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/advisor-report.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/advisor-report.html new file mode 100644 index 0000000000..f33f29c724 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/advisor-report.html @@ -0,0 +1,2 @@ +Intel Advisor Report
\ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/report.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/report.html new file mode 100644 index 0000000000..893a438e4b --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/reports/report.html @@ -0,0 +1,16 @@ +Offload report by Intel Advisor
Please wait, loading the report...
\ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_dependency.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_dependency.sh new file mode 100644 index 0000000000..3be2b92b4d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_dependency.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=offload --accuracy=low --dry-run -- ./build/src/1_CPU_only 128 128 128 20 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_survey.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_survey.sh new file mode 100644 index 0000000000..1d76367553 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_survey.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=survey --auto-finalize --static-instruction-mix -- ./build/src/1_CPU_only 128 128 128 20 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_tripcounts.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_tripcounts.sh new file mode 100644 index 0000000000..752c981637 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_advisor_tripcounts.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=tripcounts --flop --no-auto-finalize --target-device=pvc_xt_448xve -- ./build/src/1_CPU_only 128 128 128 20 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_cpu_only.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_cpu_only.sh new file mode 100644 index 0000000000..179a49c15f --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_cpu_only.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +rm -rf build +build="$PWD/build" +[ ! -d "$build" ] && mkdir -p "$build" +cd build && +cmake .. && +make run_cpu + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_dry_run_advisor.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_dry_run_advisor.sh new file mode 100644 index 0000000000..3be2b92b4d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_dry_run_advisor.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=offload --accuracy=low --dry-run -- ./build/src/1_CPU_only 128 128 128 20 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_offload_advisor.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_offload_advisor.sh new file mode 100644 index 0000000000..955799c27e --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/run_offload_advisor.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=offload --config=pvc_xt_448xve --project-dir=./../advisor/1_cpu -- ./build/src/1_CPU_only 256 256 256 20 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/1_CPU_only.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/1_CPU_only.cpp new file mode 100644 index 0000000000..4465d53864 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/1_CPU_only.cpp @@ -0,0 +1,129 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include + +#include "Utils.hpp" + +void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, const size_t n1, + const size_t n2, const size_t n3) { + auto dimn1n2 = n1 * n2; + + // Remove HALO from the end + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + // Calculate start pointers for the row over X dimension + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + // Iterate over X + for (auto ix = kHalfLength; ix < n1_end; ix++) { + // Calculate values for each cell + float value = ptr_prev[ix] * coeff[0]; + for (int i = 1; i <= kHalfLength; i++) { + value += + coeff[i] * + (ptr_prev[ix + i] + ptr_prev[ix - i] + + ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] + + ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]); + } + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void iso3dfd(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it++) { + iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3); + // Swap the pointers for always having current values in prev array + std::swap(next, prev); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + std::cout << "Running on CPU serial version\n"; + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation 1 thread serial + // version + iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/2_GPU_basic.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/2_GPU_basic.cpp new file mode 100644 index 0000000000..3571f98bfc --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/2_GPU_basic.cpp @@ -0,0 +1,153 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 3D SYCL range for buffers which include HALO + range<3> buffer_range(n1, n2, n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate values for each cell + float value = prev_acc[i][j][k] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] + + prev_acc[i][j + x][k] + prev_acc[i][j - x][k] + + prev_acc[i + x][j][k] + prev_acc[i - x][j][k]); + } + next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] + + value * vel_acc[i][j][k]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running GPU basic offload version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/3_GPU_linear.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/3_GPU_linear.cpp new file mode 100644 index 0000000000..553b38a47d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/3_GPU_linear.cpp @@ -0,0 +1,157 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 1D SYCL range for buffers which include HALO + range<1> buffer_range(n1 * n2 * n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it++) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> nidx) { + // Start of device code + // Add offsets to indices to exclude HALO + int n2n3 = n2 * n3; + int i = nidx[0] + kHalfLength; + int j = nidx[1] + kHalfLength; + int k = nidx[2] + kHalfLength; + + // Calculate linear index for each cell + int idx = i * n2n3 + j * n3 + k; + + // Calculate values for each cell + float value = prev_acc[idx] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] + + prev_acc[idx + x * n3] + prev_acc[idx - x * n3] + + prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]); + } + next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] + + value * vel_acc[idx]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/4_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/4_GPU_optimized.cpp new file mode 100644 index 0000000000..99dd9d85b8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/4_GPU_optimized.cpp @@ -0,0 +1,171 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + //coeff += (16 - b1); + //q.memcpy(coeff, coeff, sizeof(float)*grid_size); + q.wait(); + + //auto local_nd_range = range(1, n2_block, n1_block); + //auto global_nd_range = range((n3 - 2 * kHalfLength)/n3_block, (n2 - 2 * kHalfLength)/n2_block, + //(n1 - 2 * kHalfLength)); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*b3+n3_block-1)/n3_block*n3_block,(n2-2*b2+n2_block-1)/n2_block*n2_block,n1_block); + + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + //[[intel::reqd_sub_group_size(32)]] + //[[intel::kernel_args_restrict]] + { + const int iz = b3 + item.get_global_id(0); + const int iy = b2 + item.get_global_id(1); + if (iz < n3 - b3 && iy < n2 - b2) + for (int ix = b1+item.get_global_id(2); ix < n1 - b1; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/CMakeLists.txt new file mode 100644 index 0000000000..93f5af83b7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/CMakeLists.txt @@ -0,0 +1,29 @@ +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fsycl --std=c++17") +# Set default build type to RelWithDebInfo if not specified +if (NOT CMAKE_BUILD_TYPE) + message (STATUS "Default CMAKE_BUILD_TYPE not set using Release with Debug Info") + set (CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE + STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel" + FORCE) +endif() + +set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}") + +add_executable(1_CPU_only 1_CPU_only.cpp) +add_executable(2_GPU_basic 2_GPU_basic.cpp) +add_executable(3_GPU_linear 3_GPU_linear.cpp) +add_executable(4_GPU_optimized 4_GPU_optimized.cpp) + +target_link_libraries(1_CPU_only OpenCL sycl) +target_link_libraries(2_GPU_basic OpenCL sycl) +target_link_libraries(3_GPU_linear OpenCL sycl) +target_link_libraries(4_GPU_optimized OpenCL sycl) + +add_custom_target(run_all 1_CPU_only 256 256 256 20 + COMMAND 2_GPU_basic 1024 1024 1024 100 + COMMAND 3_GPU_linear 1024 1024 1024 100 + COMMAND 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_cpu 1_CPU_only 1024 1024 1024 100) +add_custom_target(run_gpu_basic 2_GPU_basic 1024 1024 1024 100) +add_custom_target(run_gpu_linear 3_GPU_linear 1024 1024 1024 100) +add_custom_target(run_gpu_optimized 4_GPU_optimized 1024 1024 1024 32 4 8 100) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Iso3dfd.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Iso3dfd.hpp new file mode 100644 index 0000000000..e3487fa0cf --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Iso3dfd.hpp @@ -0,0 +1,21 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +constexpr size_t kHalfLength = 8; +constexpr float dxyz = 50.0f; +constexpr float dt = 0.002f; + +#define STENCIL_LOOKUP(ir) \ + (coeff[ir] * ((ptr_prev[ix + ir] + ptr_prev[ix - ir]) + \ + (ptr_prev[ix + ir * n1] + ptr_prev[ix - ir * n1]) + \ + (ptr_prev[ix + ir * dimn1n2] + ptr_prev[ix - ir * dimn1n2]))) + + +#define KERNEL_STENCIL_LOOKUP(x) \ + coeff[x] * (tab[l_idx + x] + tab[l_idx - x] + front[x] + back[x - 1] \ + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x]) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Utils.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Utils.hpp new file mode 100644 index 0000000000..98d4a6e12c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/01_ISO3DFD_CPU/src/Utils.hpp @@ -0,0 +1,259 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +#include +#include + +#include "Iso3dfd.hpp" + +void Usage(const std::string& programName, bool usedNd_ranges = false) { + std::cout << "--------------------------------------\n"; + std::cout << " Incorrect parameters \n"; + std::cout << " Usage: "; + std::cout << programName << " n1 n2 n3 Iterations"; + + if (usedNd_ranges) std::cout << " kernel_iterations n2_WGS n3_WGS"; + + std::cout << " [verify]\n\n"; + std::cout << " n1 n2 n3 : Grid sizes for the stencil\n"; + std::cout << " Iterations : No. of timesteps.\n"; + + if (usedNd_ranges) { + std::cout + << " kernel_iterations : No. of cells calculated by one kernel\n"; + std::cout << " n2_WGS n3_WGS : n2 and n3 work group sizes\n"; + } + std::cout + << " [verify] : Optional: Compare results with CPU version\n"; + std::cout << "--------------------------------------\n"; + std::cout << "--------------------------------------\n"; +} + +bool ValidateInput(size_t n1, size_t n2, size_t n3, size_t num_iterations, + size_t kernel_iterations = -1, size_t n2_WGS = kHalfLength, + size_t n3_WGS = kHalfLength) { + if ((n1 < kHalfLength) || (n2 < kHalfLength) || (n3 < kHalfLength) || + (n2_WGS < kHalfLength) || (n3_WGS < kHalfLength)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid grid size : n1, n2, n3, n2_WGS, n3_WGS should be " + "greater than " + << kHalfLength << "\n"; + return true; + } + + if ((n2 < n2_WGS) || (n3 < n3_WGS)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid work group size : n2 should be greater than n2_WGS " + "and n3 greater than n3_WGS\n"; + return true; + } + + if (((n2 - 2 * kHalfLength) % n2_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n2 should be multiple of n2_WGS - " + << n2_WGS << "\n"; + return true; + } + if (((n3 - 2 * kHalfLength) % n3_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n3 should be multiple of n3_WGS - " + << n3_WGS << "\n"; + return true; + } + if (((n1 - 2 * kHalfLength) % kernel_iterations) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n1 should be multiple of " + "kernel_iterations - " + << kernel_iterations << "\n"; + return true; + } + + return false; +} + +bool CheckWorkGroupSize(sycl::queue& q, unsigned int n2_WGS, + unsigned int n3_WGS) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + if ((max_block_size > 1) && (n2_WGS * n3_WGS > max_block_size)) { + std::cout << "ERROR: Invalid block sizes: n2_WGS * n3_WGS should be " + "less than or equal to " + << max_block_size << "\n"; + return true; + } + + return false; +} + +void printTargetInfo(sycl::queue& q) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + auto max_exec_unit_count = + device.get_info(); + + std::cout << " Running on " << device.get_info() + << "\n"; + std::cout << " The Device Max Work Group Size is : " << max_block_size + << "\n"; + std::cout << " The Device Max EUCount is : " << max_exec_unit_count << "\n"; +} + +void initialize(float* ptr_prev, float* ptr_next, float* ptr_vel, size_t n1, + size_t n2, size_t n3) { + auto dim2 = n2 * n1; + + for (auto i = 0; i < n3; i++) { + for (auto j = 0; j < n2; j++) { + auto offset = i * dim2 + j * n1; + + for (auto k = 0; k < n1; k++) { + ptr_prev[offset + k] = 0.0f; + ptr_next[offset + k] = 0.0f; + ptr_vel[offset + k] = + 2250000.0f * dt * dt; // Integration of the v*v and dt*dt here + } + } + } + // Then we add a source + float val = 1.f; + for (auto s = 5; s >= 0; s--) { + for (auto i = n3 / 2 - s; i < n3 / 2 + s; i++) { + for (auto j = n2 / 4 - s; j < n2 / 4 + s; j++) { + auto offset = i * dim2 + j * n1; + for (auto k = n1 / 4 - s; k < n1 / 4 + s; k++) { + ptr_prev[offset + k] = val; + } + } + } + val *= 10; + } +} + +void printStats(double time, size_t n1, size_t n2, size_t n3, + size_t num_iterations) { + float throughput_mpoints = 0.0f, mflops = 0.0f, normalized_time = 0.0f; + double mbytes = 0.0f; + + normalized_time = (double)time / num_iterations; + throughput_mpoints = ((n1 - 2 * kHalfLength) * (n2 - 2 * kHalfLength) * + (n3 - 2 * kHalfLength)) / + (normalized_time * 1e3f); + mflops = (7.0f * kHalfLength + 5.0f) * throughput_mpoints; + mbytes = 12.0f * throughput_mpoints; + + std::cout << "--------------------------------------\n"; + std::cout << "time : " << time / 1e3f << " secs\n"; + std::cout << "throughput : " << throughput_mpoints << " Mpts/s\n"; + std::cout << "flops : " << mflops / 1e3f << " GFlops\n"; + std::cout << "bytes : " << mbytes / 1e3f << " GBytes/s\n"; + std::cout << "\n--------------------------------------\n"; + std::cout << "\n--------------------------------------\n"; +} + +bool WithinEpsilon(float* output, float* reference, const size_t dim_x, + const size_t dim_y, const size_t dim_z, + const unsigned int radius, const int zadjust = 0, + const float delta = 0.01f) { + std::ofstream error_file; + error_file.open("error_diff.txt"); + + bool error = false; + double norm2 = 0; + + for (size_t iz = 0; iz < dim_z; iz++) { + for (size_t iy = 0; iy < dim_y; iy++) { + for (size_t ix = 0; ix < dim_x; ix++) { + if (ix >= radius && ix < (dim_x - radius) && iy >= radius && + iy < (dim_y - radius) && iz >= radius && + iz < (dim_z - radius + zadjust)) { + float difference = fabsf(*reference - *output); + norm2 += difference * difference; + if (difference > delta) { + error = true; + error_file << " ERROR: " << ix << ", " << iy << ", " << iz << " " + << *output << " instead of " << *reference + << " (|e|=" << difference << ")\n"; + } + } + ++output; + ++reference; + } + } + } + + error_file.close(); + norm2 = sqrt(norm2); + if (error) std::cout << "error (Euclidean norm): " << norm2 << "\n"; + return error; +} + +void inline iso3dfdCPUIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, + const size_t n1, const size_t n2, + const size_t n3) { + auto dimn1n2 = n1 * n2; + + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + for (auto ix = kHalfLength; ix < n1_end; ix++) { + float value = ptr_prev[ix] * coeff[0]; + value += STENCIL_LOOKUP(1); + value += STENCIL_LOOKUP(2); + value += STENCIL_LOOKUP(3); + value += STENCIL_LOOKUP(4); + value += STENCIL_LOOKUP(5); + value += STENCIL_LOOKUP(6); + value += STENCIL_LOOKUP(7); + value += STENCIL_LOOKUP(8); + + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void CalculateReference(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it += 1) { + iso3dfdCPUIteration(next, prev, vel, coeff, n1, n2, n3); + std::swap(next, prev); + } +} + +void VerifyResult(float* prev, float* next, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + std::cout << "Running CPU version for result comparasion: "; + auto nsize = n1 * n2 * n3; + float* temp = new float[nsize]; + memcpy(temp, prev, nsize * sizeof(float)); + initialize(prev, next, vel, n1, n2, n3); + CalculateReference(next, prev, vel, coeff, n1, n2, n3, nreps); + bool error = WithinEpsilon(temp, prev, n1, n2, n3, kHalfLength, 0, 0.1f); + if (error) { + std::cout << "Final wavefields from SYCL device and CPU are not " + << "equivalent: Fail\n"; + } else { + std::cout << "Final wavefields from SYCL device and CPU are equivalent:" + << " Success\n"; + } + delete[] temp; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/CMakeLists.txt new file mode 100644 index 0000000000..e0bded3dae --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/CMakeLists.txt @@ -0,0 +1,4 @@ +cmake_minimum_required (VERSION 3.4) +set(CMAKE_CXX_COMPILER "icpx") +project (Iso3DFD) +add_subdirectory (src) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/gpu_basic.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/gpu_basic.png new file mode 100644 index 0000000000..1763a8f380 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/gpu_basic.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/r1.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/r1.png new file mode 100644 index 0000000000..2e4b88e186 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/img/r1.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/iso3dfd_gpu_basic.ipynb b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/iso3dfd_gpu_basic.ipynb new file mode 100644 index 0000000000..60bb210777 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/iso3dfd_gpu_basic.ipynb @@ -0,0 +1,548 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ISO3DFD and Implementation using SYCL offloading to a GPU" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Objectives" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
    \n", + "
  • Understand how to offload the most profitable loops in your code on the GPU using SYCL
  • \n", + "
  • Map arrays on the device and define how you are going to access you data
  • \n", + "
  • Offload the loops to dispatch the work on the selected device
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ISO3DFD offloading to GPU" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the previous activity, we used Intel® Offload Advisor to decide which sections of codes would be good candidates for offloading on the gen9. Advisor ended up recommending to focus on one of the most profitable loop in our serial version of the CPU code.\n", + "\n", + "Our goal is now to make sure that this loop is going to be correctly offloaded to a GPU\n", + "\n", + "Based on the output provided by the Advisor, we can see the estimated speed-up if we offload loops identified in the Top Offloaded section of the output.Using SYCL, we'll offload that function to run as a kernel on the system's GPU." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Offloading ISO3DFD application to a GPU\n", + "The 2_GPU_basic_offload version of the sample has implemented the basic offload of the iso3dfd function to an available GPU on the system.\n", + "* We have to create queue in iso3dfd function as per below.\n", + "\n", + "```\n", + "queue q(default_selector_v, {property::queue::in_order()});\n", + "```\n", + "* Instead of iterating over all the cells in the memory, we will create buffers and accessors to move the data to the GPU when needed\n", + "\n", + "```\n", + "// Create 3D SYCL range for kernels which not include HALO\n", + " range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,\n", + " n3 - 2 * kHalfLength);\n", + " // Create 3D SYCL range for buffers which include HALO\n", + " range<3> buffer_range(n1, n2, n3);\n", + " // Create buffers using SYCL class buffer\n", + " buffer next_buf(next, buffer_range);\n", + " buffer prev_buf(prev, buffer_range);\n", + " buffer vel_buf(vel, buffer_range);\n", + " buffer coeff_buf(coeff, range(kHalfLength + 1));\n", + " ```\n", + "* Create a kernel which will do the calculations, each kernel will calculate one cell.\n", + "\n", + "```\n", + "// Send a SYCL kernel(lambda) to the device for parallel execution\n", + " // Each kernel runs single cell\n", + " h.parallel_for(kernel_range, [=](id<3> idx) {\n", + " // Start of device code\n", + " // Add offsets to indices to exclude HALO\n", + " int i = idx[0] + kHalfLength;\n", + " int j = idx[1] + kHalfLength;\n", + " int k = idx[2] + kHalfLength;\n", + "\n", + " // Calculate values for each cell\n", + " //Please refer to the below source code \n", + " });\n", + " });\n", + " \n", + "```\n", + "The SYCL code below shows Iso3dFD GPU code using SYCL: Inspect code, there are no modifications necessary:\n", + "1. Inspect the code cell below and click run ▶ to save the code to file\n", + "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/2_GPU_basic.cpp\n", + "//==============================================================\n", + "// Copyright © Intel Corporation\n", + "//\n", + "// SPDX-License-Identifier: MIT\n", + "// =============================================================\n", + "\n", + "#include \n", + "#include \n", + "#include \n", + "#include \n", + "\n", + "#include \"Utils.hpp\"\n", + "\n", + "using namespace sycl;\n", + "\n", + "void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff,\n", + " const size_t n1, const size_t n2, const size_t n3,\n", + " const size_t nreps) {\n", + " // Create 3D SYCL range for kernels which not include HALO\n", + " range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,\n", + " n3 - 2 * kHalfLength);\n", + " // Create 3D SYCL range for buffers which include HALO\n", + " range<3> buffer_range(n1, n2, n3);\n", + " // Create buffers using SYCL class buffer\n", + " buffer next_buf(next, buffer_range);\n", + " buffer prev_buf(prev, buffer_range);\n", + " buffer vel_buf(vel, buffer_range);\n", + " buffer coeff_buf(coeff, range(kHalfLength + 1));\n", + "\n", + " for (auto it = 0; it < nreps; it += 1) {\n", + " // Submit command group for execution\n", + " q.submit([&](handler& h) {\n", + " // Create accessors\n", + " accessor next_acc(next_buf, h);\n", + " accessor prev_acc(prev_buf, h);\n", + " accessor vel_acc(vel_buf, h, read_only);\n", + " accessor coeff_acc(coeff_buf, h, read_only);\n", + "\n", + " // Send a SYCL kernel(lambda) to the device for parallel execution\n", + " // Each kernel runs single cell\n", + " h.parallel_for(kernel_range, [=](id<3> idx) {\n", + " // Start of device code\n", + " // Add offsets to indices to exclude HALO\n", + " int i = idx[0] + kHalfLength;\n", + " int j = idx[1] + kHalfLength;\n", + " int k = idx[2] + kHalfLength;\n", + "\n", + " // Calculate values for each cell\n", + " float value = prev_acc[i][j][k] * coeff_acc[0];\n", + "#pragma unroll(8)\n", + " for (int x = 1; x <= kHalfLength; x++) {\n", + " value +=\n", + " coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] +\n", + " prev_acc[i][j + x][k] + prev_acc[i][j - x][k] +\n", + " prev_acc[i + x][j][k] + prev_acc[i - x][j][k]);\n", + " }\n", + " next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] +\n", + " value * vel_acc[i][j][k];\n", + " // End of device code\n", + " });\n", + " });\n", + "\n", + " // Swap the buffers for always having current values in prev buffer\n", + " std::swap(next_buf, prev_buf);\n", + " }\n", + "}\n", + "\n", + "int main(int argc, char* argv[]) {\n", + " // Arrays used to update the wavefield\n", + " float* prev;\n", + " float* next;\n", + " // Array to store wave velocity\n", + " float* vel;\n", + "\n", + " // Variables to store size of grids and number of simulation iterations\n", + " size_t n1, n2, n3;\n", + " size_t num_iterations;\n", + "\n", + " // Flag to verify results with CPU version\n", + " bool verify = false;\n", + "\n", + " if (argc < 5) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " try {\n", + " // Parse command line arguments and increase them by HALO\n", + " n1 = std::stoi(argv[1]) + (2 * kHalfLength);\n", + " n2 = std::stoi(argv[2]) + (2 * kHalfLength);\n", + " n3 = std::stoi(argv[3]) + (2 * kHalfLength);\n", + " num_iterations = std::stoi(argv[4]);\n", + " if (argc > 5) verify = true;\n", + " } catch (...) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Validate input sizes for the grid\n", + " if (ValidateInput(n1, n2, n3, num_iterations)) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Create queue and print target info with default selector and in order\n", + " // property\n", + " queue q(default_selector_v, {property::queue::in_order()});\n", + " std::cout << \" Running GPU basic offload version\\n\";\n", + " printTargetInfo(q);\n", + "\n", + " // Compute the total size of grid\n", + " size_t nsize = n1 * n2 * n3;\n", + "\n", + " prev = new float[nsize];\n", + " next = new float[nsize];\n", + " vel = new float[nsize];\n", + "\n", + " // Compute coefficients to be used in wavefield update\n", + " float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1,\n", + " +7.572087e-2, -1.76767677e-2, +3.480962e-3,\n", + " -5.180005e-4, +5.074287e-5, -2.42812e-6};\n", + "\n", + " // Apply the DX, DY and DZ to coefficients\n", + " coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);\n", + " for (auto i = 1; i <= kHalfLength; i++) {\n", + " coeff[i] = coeff[i] / (dxyz * dxyz);\n", + " }\n", + "\n", + " // Initialize arrays and introduce initial conditions (source)\n", + " initialize(prev, next, vel, n1, n2, n3);\n", + "\n", + " auto start = std::chrono::steady_clock::now();\n", + "\n", + " // Invoke the driver function to perform 3D wave propagation offloaded to\n", + " // the device\n", + " iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);\n", + "\n", + " auto end = std::chrono::steady_clock::now();\n", + " auto time = std::chrono::duration_cast(end - start)\n", + " .count();\n", + " printStats(time, n1, n2, n3, num_iterations);\n", + "\n", + " // Verify result with the CPU serial version\n", + " if (verify) {\n", + " VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);\n", + " }\n", + "\n", + " delete[] prev;\n", + " delete[] next;\n", + " delete[] vel;\n", + "\n", + " return 0;\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the application is created, we can run it from the command line by using few parameters as following:\n", + "src/2_GPU_basic 256 256 256 100\n", + "
    \n", + "
  • bin/2_GPU_basic is the binary
  • \n", + "
  • 128 128 128 are the size for the 3 dimensions, increasing it will result in more computation time
  • \n", + "
  • 100 is the number of time steps
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_only.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_only.sh; else ./run_gpu_only.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Iso3DFD GPU Optimizations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We started from a code version running with standard C++ on the CPU. Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.\n", + "\n", + "Getting the best performances possible on the CPU or on the GPU would require some fine tuning specific to each platform but we already have a portable solution.\n", + "\n", + "The next step, to optimize further on the GPU would be to run the Roofline Model and/or VTune to try to understand if we have obvious bottlenecks." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## What is the Roofline Model?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks. Intel Advisor includes an automated Roofline tool that measures and plots the chart on its own, so all you need to do is read it.\n", + "\n", + "The chart can be used to identify not only where bottlenecks exist, but what’s likely causing them, and which ones will provide the most speedup if optimized.\n", + "\n", + "#### Requirements for a Roofline Model on a GPU\n", + "In order to generate a roofline analysis report ,application must be at least partially running on a GPU and the Offload must be implemented with OpenMP, SYCL or OpenCL and a recent version of Intel® Advisor \n", + "\n", + "Generating a Roofline Model on GPU generates a multi-level roofline where a single loop generates several dots and each dot can be compared to its own memory (GTI/L3/DRAM/SLM)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "#### Finding Effective Optimization Strategies\n", + " Here are the GPU Roofline Performance Insights, it highlights poor performing loops and shows performance ‘headroom’ for each loop which can be improved and which are worth improving. The report shows likely causes of bottlenecks where it can be Memory bound vs. compute bound. It also suggests next optimization steps\n", + "\n", + " \n", + " \n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Running the GPU Roofline Analysis\n", + "With the offload implemented in 2_GPU_basic using SYCL, we'll want to run roofline analysis to look for areas where there is room for performance optimization.\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication \n", + "```\n", + "The iso3DFD CPU code can be run using\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./../advisor/2_gpu -- ./build/src/2_GPU_basic 256 256 256 100\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_roofline_advisor.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_roofline_advisor.sh; else ./run_gpu_roofline_advisor.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyzing the HTML report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the roofline analysis of the 2_GPU_basic_offload.cpp version, we can see that the performance is close to predicted. \n", + "As noted in the below roofline model we can observe that,\n", + "\n", + "* The application is bounded by compute, specifically that the kernels have high arithmetic intensity.\n", + "* GINTOPS is more than 15X of the GFLOPS\n", + "* High XVE Threading Occupancy\n", + "* We are clearly bounded by the INT operations which is all about index computations\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Roofline Analysis report\n", + "To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. \n", + "\n", + "[View the report in HTML](reports/advisor-report.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/advisor-report.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Generating VTune reports\n", + "Below exercises we use VTune™ analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.\n", + "\n", + "#### Learn more about VTune\n", + "​\n", + "There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/2_GPU_basic 1024 1024 1024 100\n", + "```\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/2_GPU_basic 1024 1024 1024 100\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html\n", + "```\n", + "[View the Vtune offload report in HTML](reports/output_offload.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_offload.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[View the Vtune hotspots report in HTML](reports/output_hotspots.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_hotspots.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_vtune.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_vtune.sh; else ./run_gpu_vtune.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "### Next Iteration of implemeting GPU Optimizations\n", + "\n", + "We ran the roofline model and observed:\n", + "* The application is now bounded by compute, specifically that the kernels have high arithmetic intensity and we are bounded by the INT operations which is all about index computations.\n", + "* What we need to solve, is to provide to the kernel the good index (offset in the original code). \n", + "* SYCL provides this information through an iterator that is sent by the runtime to the function. This iterator allows to identify the position of the current iteration in the 3D space. \n", + "* It can be accessed on 3 dimensions by calling: it.get_global_id(0), it.get_global_id(1), it.get_global_id(2).\n", + "* In this next iteration, we'll address the problem being compute bound in kernels by reducing index calculations by changing how we calculate indices.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/q b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/q new file mode 100644 index 0000000000..9bbad910d7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/q @@ -0,0 +1,49 @@ +#!/bin/bash +#========================================== +# Copyright © Intel Corporation +# +# SPDX-License-Identifier: MIT +#========================================== +# Script to submit job in Intel(R) DevCloud +# Version: 0.71 +#========================================== +if [ -z "$1" ]; then + echo "Missing script argument, Usage: ./q run.sh" +elif [ ! -f "$1" ]; then + echo "File $1 does not exist" +else + echo "Job has been submitted to Intel(R) DevCloud and will execute soon." + echo "" + script=$1 + # Remove old output files + rm *.sh.* > /dev/null 2>&1 + # Submit job using qsub + qsub_id=`qsub -l nodes=1:gpu:ppn=2 -d . $script` + job_id="$(cut -d'.' -f1 <<<"$qsub_id")" + # Print qstat output + qstat + # Wait for output file to be generated and display + echo "" + echo -ne "Waiting for Output " + until [ -f $script.o$job_id ]; do + sleep 1 + echo -ne "█" + ((timeout++)) + # Timeout if no output file generated within 60 seconds + if [ $timeout == 70 ]; then + echo "" + echo "" + echo "TimeOut 60 seconds: Job is still queued for execution, check for output file later ($script.o$job_id)" + echo "" + break + fi + done + # Print output and error file content if exist + if [ -n "$(find -name '*.sh.o'$job_id)" ]; then + echo " Done⬇" + cat $script.o$job_id + cat $script.e$job_id + echo "Job Completed in $timeout seconds." + rm *.sh.*$job_id > /dev/null 2>&1 + fi +fi diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/advisor-report.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/advisor-report.html new file mode 100644 index 0000000000..a21f3e0645 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/advisor-report.html @@ -0,0 +1,2 @@ +Intel Advisor Report
\ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_hotspots.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_hotspots.html new file mode 100644 index 0000000000..10391f82ab --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_hotspots.html @@ -0,0 +1,144 @@ + +
Intel® VTune Profiler 2024.0.0
  • Elapsed Time: + 36.499s
    • GPU Time: + 27.634s
  • Display controller: Intel Corporation Device 0x0bda Device Group: +
    • XVE Array Stalled/Idle: + 88.1% of Elapsed time with GPU busy
      The percentage of time when the XVEs were stalled or idle is high, which has a negative impact on compute-bound applications.
      • This section shows the XVE metrics per stack and per adapter for all the devices in this group.: +
        GPU StackGPU AdapterXVE Array Active(%)XVE Array Stalled(%)XVE Array Idle(%)
        0GPU 136.1%39.5%24.4%
        0GPU 30.0%0.0%100.0%
        0GPU 00.0%0.0%100.0%
        0GPU 20.0%0.0%100.0%
    • GPU L3 Bandwidth Bound: + 1.3% of peak value
    • Occupancy: + 22.1% of peak value
      Several factors including shared local memory, use of memory barriers, and inefficient work scheduling can cause a low value of the occupancy metric.
      • This section shows the computing tasks with low occupancy metric for all the devices in this group.: +
        Computing TaskTotal TimeOccupancy(%)SIMD Utilization(%)
        iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<(int)3>)#1}14.744s23.7% of peak value0.0%
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/2_GPU_basic "1024" "1024" "1024" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-06
    • Result Size: + 297.4 MB
    • Collection start time: + 23:44:22 14/03/2024 UTC
    • Collection stop time: + 23:44:58 14/03/2024 UTC
    • Collector Type: + Event-based sampling driver,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_offload.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_offload.html new file mode 100644 index 0000000000..920ddfbb21 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/reports/output_offload.html @@ -0,0 +1,142 @@ + +
Intel® VTune Profiler 2024.0.0

Recommendations:

GPU Time, % of Elapsed time: 75.2%
GPU utilization is low. Switch to the for in-depth analysis of host activity. Poor GPU utilization can prevent the application from offloading effectively.
XVE Array Stalled/Idle: 52.4% of Elapsed time with GPU busy
GPU metrics detect some kernel issues. Use GPU Compute/Media Hotspots (preview) to understand how well your application runs on the specified hardware.
  • Elapsed Time: + 36.890s
    • GPU Time, % of Elapsed time: + 75.2%
      GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.
      • GPU Time, % of Elapsed time: +
        GPU AdapterGPU EngineGPU TimeGPU Time, % of Elapsed time(%)
        GPU 1Render and GPGPU27.734s75.2%
      • Top Hotspots when GPU was idle: +
        FunctionModuleCPU Time
        asm_exc_page_faultvmlinux2.920s
        [Skipped stack frame(s)][Unknown]1.644s
        operator newlibc++abi.so1.484s
        func@0x13f9b0libze_intel_gpu.so.1.3.27191.421.434s
        memcmplibc-dynamic.so1.420s
        [Others]N/A15.764s
  • Hottest Host Tasks: +
    Host TaskTask Time% of Elapsed Time(%)Task Count
    zeEventHostSynchronize21.643s58.7%14
    zeCommandListAppendMemoryCopy6.066s16.4%1
    zeModuleCreate0.259s0.7%1
    zeCommandListAppendMemoryCopyRegion0.071s0.2%5
    zeCommandListCreateImmediate0.001s0.0%3
    [Others]0.001s0.0%105
  • Hottest GPU Computing Tasks: +
    Computing TaskTotal TimeExecution Time% of Total Time(%)SIMD Width
    iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<(int)3>)#1}27.791s14.766s53.1%32
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/2_GPU_basic "1024" "1024" "1024" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-06
    • Result Size: + 320.1 MB
    • Collection start time: + 23:42:59 14/03/2024 UTC
    • Collection stop time: + 23:43:35 14/03/2024 UTC
    • Collector Type: + Event-based sampling driver,Driverless Perf system-wide sampling,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_only.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_only.sh new file mode 100644 index 0000000000..ae46cb4218 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_only.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +rm -rf build +build="$PWD/build" +[ ! -d "$build" ] && mkdir -p "$build" +cd build && +cmake .. && +make run_gpu_basic diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline.sh new file mode 100644 index 0000000000..afc59abe29 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline.sh @@ -0,0 +1,12 @@ +#!/bin/bash +#advisor --collect=survey --profile-gpu --project-dir=./advi_results -- ./build/src/2_GPU_basic 256 256 256 100 +#advisor --collect=tripcounts --flop --profile-gpu --project-dir=./advi_results -- ./build/src/2_GPU_basic 256 256 256 100 +#advisor --collect=projection --profile-gpu --model-baseline-gpu --project-dir=./advi_results + +advisor --collect=survey --profile-gpu --project-dir=./roofline -- ./build/src/2_GPU_basic 256 256 256 100 +advisor --collect=tripcounts --profile-gpu --project-dir=./roofline -- ./build/src/2_GPU_basic 256 256 256 100 +advisor --collect=projection --profile-gpu --model-baseline-gpu --project-dir=./roofline +advisor --report=roofline --gpu --project-dir=roofline --report-output=./roofline/roofline.html + + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline_advisor.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline_advisor.sh new file mode 100644 index 0000000000..3badb9c173 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_roofline_advisor.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=roofline --profile-gpu --project-dir=./../advisor/2_gpu -- ./build/src/2_GPU_basic 1024 1024 1024 100 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_vtune.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_vtune.sh new file mode 100644 index 0000000000..fb135230fa --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/run_gpu_vtune.sh @@ -0,0 +1,6 @@ +#!/bin/bash +vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/2_GPU_basic 1024 1024 1024 100 +vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/2_GPU_basic 1024 1024 1024 100 +vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html +vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/1_CPU_only.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/1_CPU_only.cpp new file mode 100644 index 0000000000..97730a9aec --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/1_CPU_only.cpp @@ -0,0 +1,129 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include + +#include "Utils.hpp" + +void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, const size_t n1, + const size_t n2, const size_t n3) { + auto dimn1n2 = n1 * n2; + + // Remove HALO from the end + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + // Calculate start pointers for the row over X dimension + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + // Iterate over X + for (auto ix = kHalfLength; ix < n1_end; ix++) { + // Calculate values for each cell + float value = ptr_prev[ix] * coeff[0]; + for (int i = 1; i <= kHalfLength; i++) { + value += + coeff[i] * + (ptr_prev[ix + i] + ptr_prev[ix - i] + + ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] + + ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]); + } + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void iso3dfd(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it++) { + iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3); + // Swap the pointers for always having current values in prev array + std::swap(next, prev); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + std::cout << "Running on CPU serial version\n"; + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation 1 thread serial + // version + iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/2_GPU_basic.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/2_GPU_basic.cpp new file mode 100644 index 0000000000..ae72c0f73b --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/2_GPU_basic.cpp @@ -0,0 +1,153 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 3D SYCL range for buffers which include HALO + range<3> buffer_range(n1, n2, n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate values for each cell + float value = prev_acc[i][j][k] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] + + prev_acc[i][j + x][k] + prev_acc[i][j - x][k] + + prev_acc[i + x][j][k] + prev_acc[i - x][j][k]); + } + next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] + + value * vel_acc[i][j][k]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running GPU basic offload version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/3_GPU_linear.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/3_GPU_linear.cpp new file mode 100644 index 0000000000..553b38a47d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/3_GPU_linear.cpp @@ -0,0 +1,157 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 1D SYCL range for buffers which include HALO + range<1> buffer_range(n1 * n2 * n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it++) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> nidx) { + // Start of device code + // Add offsets to indices to exclude HALO + int n2n3 = n2 * n3; + int i = nidx[0] + kHalfLength; + int j = nidx[1] + kHalfLength; + int k = nidx[2] + kHalfLength; + + // Calculate linear index for each cell + int idx = i * n2n3 + j * n3 + k; + + // Calculate values for each cell + float value = prev_acc[idx] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] + + prev_acc[idx + x * n3] + prev_acc[idx - x * n3] + + prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]); + } + next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] + + value * vel_acc[idx]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/4_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/4_GPU_optimized.cpp new file mode 100644 index 0000000000..99dd9d85b8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/4_GPU_optimized.cpp @@ -0,0 +1,171 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + //coeff += (16 - b1); + //q.memcpy(coeff, coeff, sizeof(float)*grid_size); + q.wait(); + + //auto local_nd_range = range(1, n2_block, n1_block); + //auto global_nd_range = range((n3 - 2 * kHalfLength)/n3_block, (n2 - 2 * kHalfLength)/n2_block, + //(n1 - 2 * kHalfLength)); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*b3+n3_block-1)/n3_block*n3_block,(n2-2*b2+n2_block-1)/n2_block*n2_block,n1_block); + + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + //[[intel::reqd_sub_group_size(32)]] + //[[intel::kernel_args_restrict]] + { + const int iz = b3 + item.get_global_id(0); + const int iy = b2 + item.get_global_id(1); + if (iz < n3 - b3 && iy < n2 - b2) + for (int ix = b1+item.get_global_id(2); ix < n1 - b1; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/CMakeLists.txt new file mode 100644 index 0000000000..69872e9a1e --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/CMakeLists.txt @@ -0,0 +1,30 @@ +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fsycl --std=c++17") +# Set default build type to RelWithDebInfo if not specified +if (NOT CMAKE_BUILD_TYPE) + message (STATUS "Default CMAKE_BUILD_TYPE not set using Release with Debug Info") + set (CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE + STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel" + FORCE) +endif() + +set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}") + +add_executable(1_CPU_only 1_CPU_only.cpp) +add_executable(2_GPU_basic 2_GPU_basic.cpp) +add_executable(3_GPU_linear 3_GPU_linear.cpp) +add_executable(4_GPU_optimized 4_GPU_optimized.cpp) + +target_link_libraries(1_CPU_only OpenCL sycl) +target_link_libraries(2_GPU_basic OpenCL sycl) +target_link_libraries(3_GPU_linear OpenCL sycl) +target_link_libraries(4_GPU_optimized OpenCL sycl) + +add_custom_target(run_all 1_CPU_only 1024 1024 1024 100 + COMMAND 2_GPU_basic 1024 1024 1024 100 + COMMAND 3_GPU_linear 1024 1024 1024 100 + COMMAND 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_cpu 1_CPU_only 1024 1024 1024 100) +add_custom_target(run_gpu_basic 2_GPU_basic 1024 1024 1024 100) +add_custom_target(run_gpu_linear 3_GPU_linear 1024 1024 1024 100) +add_custom_target(run_gpu_optimized 4_GPU_optimized 1024 1024 1024 32 4 8 100) + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Iso3dfd.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Iso3dfd.hpp new file mode 100644 index 0000000000..e3487fa0cf --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Iso3dfd.hpp @@ -0,0 +1,21 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +constexpr size_t kHalfLength = 8; +constexpr float dxyz = 50.0f; +constexpr float dt = 0.002f; + +#define STENCIL_LOOKUP(ir) \ + (coeff[ir] * ((ptr_prev[ix + ir] + ptr_prev[ix - ir]) + \ + (ptr_prev[ix + ir * n1] + ptr_prev[ix - ir * n1]) + \ + (ptr_prev[ix + ir * dimn1n2] + ptr_prev[ix - ir * dimn1n2]))) + + +#define KERNEL_STENCIL_LOOKUP(x) \ + coeff[x] * (tab[l_idx + x] + tab[l_idx - x] + front[x] + back[x - 1] \ + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x]) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Utils.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Utils.hpp new file mode 100644 index 0000000000..98d4a6e12c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/02_ISO3DFD_GPU_Basic/src/Utils.hpp @@ -0,0 +1,259 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +#include +#include + +#include "Iso3dfd.hpp" + +void Usage(const std::string& programName, bool usedNd_ranges = false) { + std::cout << "--------------------------------------\n"; + std::cout << " Incorrect parameters \n"; + std::cout << " Usage: "; + std::cout << programName << " n1 n2 n3 Iterations"; + + if (usedNd_ranges) std::cout << " kernel_iterations n2_WGS n3_WGS"; + + std::cout << " [verify]\n\n"; + std::cout << " n1 n2 n3 : Grid sizes for the stencil\n"; + std::cout << " Iterations : No. of timesteps.\n"; + + if (usedNd_ranges) { + std::cout + << " kernel_iterations : No. of cells calculated by one kernel\n"; + std::cout << " n2_WGS n3_WGS : n2 and n3 work group sizes\n"; + } + std::cout + << " [verify] : Optional: Compare results with CPU version\n"; + std::cout << "--------------------------------------\n"; + std::cout << "--------------------------------------\n"; +} + +bool ValidateInput(size_t n1, size_t n2, size_t n3, size_t num_iterations, + size_t kernel_iterations = -1, size_t n2_WGS = kHalfLength, + size_t n3_WGS = kHalfLength) { + if ((n1 < kHalfLength) || (n2 < kHalfLength) || (n3 < kHalfLength) || + (n2_WGS < kHalfLength) || (n3_WGS < kHalfLength)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid grid size : n1, n2, n3, n2_WGS, n3_WGS should be " + "greater than " + << kHalfLength << "\n"; + return true; + } + + if ((n2 < n2_WGS) || (n3 < n3_WGS)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid work group size : n2 should be greater than n2_WGS " + "and n3 greater than n3_WGS\n"; + return true; + } + + if (((n2 - 2 * kHalfLength) % n2_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n2 should be multiple of n2_WGS - " + << n2_WGS << "\n"; + return true; + } + if (((n3 - 2 * kHalfLength) % n3_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n3 should be multiple of n3_WGS - " + << n3_WGS << "\n"; + return true; + } + if (((n1 - 2 * kHalfLength) % kernel_iterations) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n1 should be multiple of " + "kernel_iterations - " + << kernel_iterations << "\n"; + return true; + } + + return false; +} + +bool CheckWorkGroupSize(sycl::queue& q, unsigned int n2_WGS, + unsigned int n3_WGS) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + if ((max_block_size > 1) && (n2_WGS * n3_WGS > max_block_size)) { + std::cout << "ERROR: Invalid block sizes: n2_WGS * n3_WGS should be " + "less than or equal to " + << max_block_size << "\n"; + return true; + } + + return false; +} + +void printTargetInfo(sycl::queue& q) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + auto max_exec_unit_count = + device.get_info(); + + std::cout << " Running on " << device.get_info() + << "\n"; + std::cout << " The Device Max Work Group Size is : " << max_block_size + << "\n"; + std::cout << " The Device Max EUCount is : " << max_exec_unit_count << "\n"; +} + +void initialize(float* ptr_prev, float* ptr_next, float* ptr_vel, size_t n1, + size_t n2, size_t n3) { + auto dim2 = n2 * n1; + + for (auto i = 0; i < n3; i++) { + for (auto j = 0; j < n2; j++) { + auto offset = i * dim2 + j * n1; + + for (auto k = 0; k < n1; k++) { + ptr_prev[offset + k] = 0.0f; + ptr_next[offset + k] = 0.0f; + ptr_vel[offset + k] = + 2250000.0f * dt * dt; // Integration of the v*v and dt*dt here + } + } + } + // Then we add a source + float val = 1.f; + for (auto s = 5; s >= 0; s--) { + for (auto i = n3 / 2 - s; i < n3 / 2 + s; i++) { + for (auto j = n2 / 4 - s; j < n2 / 4 + s; j++) { + auto offset = i * dim2 + j * n1; + for (auto k = n1 / 4 - s; k < n1 / 4 + s; k++) { + ptr_prev[offset + k] = val; + } + } + } + val *= 10; + } +} + +void printStats(double time, size_t n1, size_t n2, size_t n3, + size_t num_iterations) { + float throughput_mpoints = 0.0f, mflops = 0.0f, normalized_time = 0.0f; + double mbytes = 0.0f; + + normalized_time = (double)time / num_iterations; + throughput_mpoints = ((n1 - 2 * kHalfLength) * (n2 - 2 * kHalfLength) * + (n3 - 2 * kHalfLength)) / + (normalized_time * 1e3f); + mflops = (7.0f * kHalfLength + 5.0f) * throughput_mpoints; + mbytes = 12.0f * throughput_mpoints; + + std::cout << "--------------------------------------\n"; + std::cout << "time : " << time / 1e3f << " secs\n"; + std::cout << "throughput : " << throughput_mpoints << " Mpts/s\n"; + std::cout << "flops : " << mflops / 1e3f << " GFlops\n"; + std::cout << "bytes : " << mbytes / 1e3f << " GBytes/s\n"; + std::cout << "\n--------------------------------------\n"; + std::cout << "\n--------------------------------------\n"; +} + +bool WithinEpsilon(float* output, float* reference, const size_t dim_x, + const size_t dim_y, const size_t dim_z, + const unsigned int radius, const int zadjust = 0, + const float delta = 0.01f) { + std::ofstream error_file; + error_file.open("error_diff.txt"); + + bool error = false; + double norm2 = 0; + + for (size_t iz = 0; iz < dim_z; iz++) { + for (size_t iy = 0; iy < dim_y; iy++) { + for (size_t ix = 0; ix < dim_x; ix++) { + if (ix >= radius && ix < (dim_x - radius) && iy >= radius && + iy < (dim_y - radius) && iz >= radius && + iz < (dim_z - radius + zadjust)) { + float difference = fabsf(*reference - *output); + norm2 += difference * difference; + if (difference > delta) { + error = true; + error_file << " ERROR: " << ix << ", " << iy << ", " << iz << " " + << *output << " instead of " << *reference + << " (|e|=" << difference << ")\n"; + } + } + ++output; + ++reference; + } + } + } + + error_file.close(); + norm2 = sqrt(norm2); + if (error) std::cout << "error (Euclidean norm): " << norm2 << "\n"; + return error; +} + +void inline iso3dfdCPUIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, + const size_t n1, const size_t n2, + const size_t n3) { + auto dimn1n2 = n1 * n2; + + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + for (auto ix = kHalfLength; ix < n1_end; ix++) { + float value = ptr_prev[ix] * coeff[0]; + value += STENCIL_LOOKUP(1); + value += STENCIL_LOOKUP(2); + value += STENCIL_LOOKUP(3); + value += STENCIL_LOOKUP(4); + value += STENCIL_LOOKUP(5); + value += STENCIL_LOOKUP(6); + value += STENCIL_LOOKUP(7); + value += STENCIL_LOOKUP(8); + + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void CalculateReference(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it += 1) { + iso3dfdCPUIteration(next, prev, vel, coeff, n1, n2, n3); + std::swap(next, prev); + } +} + +void VerifyResult(float* prev, float* next, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + std::cout << "Running CPU version for result comparasion: "; + auto nsize = n1 * n2 * n3; + float* temp = new float[nsize]; + memcpy(temp, prev, nsize * sizeof(float)); + initialize(prev, next, vel, n1, n2, n3); + CalculateReference(next, prev, vel, coeff, n1, n2, n3, nreps); + bool error = WithinEpsilon(temp, prev, n1, n2, n3, kHalfLength, 0, 0.1f); + if (error) { + std::cout << "Final wavefields from SYCL device and CPU are not " + << "equivalent: Fail\n"; + } else { + std::cout << "Final wavefields from SYCL device and CPU are equivalent:" + << " Success\n"; + } + delete[] temp; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/CMakeLists.txt new file mode 100644 index 0000000000..e0bded3dae --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/CMakeLists.txt @@ -0,0 +1,4 @@ +cmake_minimum_required (VERSION 3.4) +set(CMAKE_CXX_COMPILER "icpx") +project (Iso3DFD) +add_subdirectory (src) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/gpu_linear.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/gpu_linear.png new file mode 100644 index 0000000000..04eee494c5 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/gpu_linear.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/roofline2.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/roofline2.png new file mode 100644 index 0000000000..8f142cfc31 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/img/roofline2.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/iso3dfd_gpu_linear.ipynb b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/iso3dfd_gpu_linear.ipynb new file mode 100644 index 0000000000..5e63ad982a --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/iso3dfd_gpu_linear.ipynb @@ -0,0 +1,748 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ISO3DFD on a GPU and Index computations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Objectives" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
    \n", + "
  • Understand how to address the application being compute bound by reducing index calculations
  • \n", + "
  • Run roofline analysis and the VTune reports again to gauge the results and look for additional opportunities
  • \n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Iso3DFD reducing the index calculations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the previous activity, we used Intel® Advisor roofline analysis to decide on if the application is compute bound and specifically that the kernels have high arithmetic intensity and we are bounded by the INT operations which is all about index computations.\n", + "What we need to solve, is to provide to the kernel the good index (offset in the original code). SYCL provides this information through an iterator that is sent by the runtime to the function. This iterator allows to identify the position of the current iteration in the 3D space. It can be accessed on 3 dimensions by calling: it.get_global_id(0), it.get_global_id(1), it.get_global_id(2).\n", + "\n", + "In this notebook, we'll address the problem being compute bound in kernels by reducing index calculations by changing how we calculate indices." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Optimizing the Indexing of the Iso3DFD application\n", + "The 3_GPU_linear version of the sample has implemented the index calculations optimization, where we can change the 3D indexing to 1D. We need to flatten the buffers change how we calculate location in the memory for each kernel, and change how we are accessing the neighbors.\n", + "* For index calculations optimization, we need to change the 3D indexing to 1D and also need to flatten the buffers\n", + "\n", + "```\n", + "// Create 1D SYCL range for buffers which include HALO\n", + "range<1> buffer_range(n1 * n2 * n3);\n", + "// Create buffers using SYCL class buffer\n", + "buffer next_buf(next, buffer_range);\n", + "buffer prev_buf(prev, buffer_range);\n", + "buffer vel_buf(vel, buffer_range);\n", + "buffer coeff_buf(coeff, range(kHalfLength + 1));\n", + "```\n", + "\n", + "* We change how we calculate location in the memory for each kernel\n", + "\n", + "```\n", + "// Start of device code\n", + "// Add offsets to indices to exclude HALO\n", + "int n2n3 = n2 * n3;\n", + "int i = nidx[0] + kHalfLength;\n", + "int j = nidx[1] + kHalfLength;\n", + "int k = nidx[2] + kHalfLength;\n", + "\n", + "// Calculate linear index for each cell\n", + "int idx = i * n2n3 + j * n3 + k;\n", + "\n", + "```\n", + "* We change how we are accessing the neighbors\n", + "\n", + "```\n", + "// Calculate values for each cell\n", + " float value = prev_acc[idx] * coeff_acc[0];\n", + "#pragma unroll(8)\n", + " for (int x = 1; x <= kHalfLength; x++) {\n", + " value +=\n", + " coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] +\n", + " prev_acc[idx + x * n3] + prev_acc[idx - x * n3] +\n", + " prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]);\n", + " }\n", + " next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] +\n", + " value * vel_acc[idx];\n", + "// End of device code\n", + "});\n", + "});\n", + "\n", + "```\n", + "We will run roofline analysis and the VTune reports again to gauge the results and look for additional opportunities for optimization based on 3_GPU_linear.\n", + "\n", + "The SYCL code below shows Iso3dFD GPU code using SYCL with Index optimizations: Inspect code, there are no modifications necessary:\n", + "1. Inspect the code cell below and click run ▶ to save the code to file\n", + "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/3_GPU_linear_USM.cpp\n", + "//==============================================================\n", + "// Copyright © Intel Corporation\n", + "//\n", + "// SPDX-License-Identifier: MIT\n", + "// =============================================================\n", + "#include \n", + "#include \n", + "#include \n", + "#include \n", + "\n", + "#include \"Utils.hpp\"\n", + "\n", + "using namespace sycl;\n", + "\n", + "bool iso3dfd(sycl::queue &q, float *ptr_next, float *ptr_prev,\n", + " float *ptr_vel, float *ptr_coeff, size_t n1, size_t n2,\n", + " size_t n3, unsigned int nIterations) {\n", + " auto nx = n1;\n", + " auto nxy = n1*n2;\n", + " auto grid_size = n1*n2*n3;\n", + " auto b1 = kHalfLength;\n", + " auto b2 = kHalfLength;\n", + " auto b3 = kHalfLength;\n", + " \n", + " // Create 3D SYCL range for kernels which not include HALO\n", + " range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,\n", + " n3 - 2 * kHalfLength);\n", + "\n", + " auto next = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " next += (16 - b1);\n", + " q.memcpy(next, ptr_next, sizeof(float)*grid_size);\n", + " auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " prev += (16 - b1);\n", + " q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);\n", + " auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " vel += (16 - b1);\n", + " q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);\n", + " auto coeff = sycl::aligned_alloc_device(64, kHalfLength + 1, q);\n", + " //coeff += (16 - b1);\n", + " q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1));\n", + " q.wait();\n", + "\n", + " for (auto it = 0; it < nIterations; it += 1) {\n", + " // Submit command group for execution\n", + " q.submit([&](handler& h) {\n", + " // Send a SYCL kernel(lambda) to the device for parallel execution\n", + " // Each kernel runs single cell\n", + " h.parallel_for(kernel_range, [=](id<3> idx) {\n", + " // Start of device code\n", + " // Add offsets to indices to exclude HALO\n", + " int n2n3 = n2 * n3;\n", + " int i = idx[0] + kHalfLength;\n", + " int j = idx[1] + kHalfLength;\n", + " int k = idx[2] + kHalfLength;\n", + "\n", + " // Calculate linear index for each cell\n", + " int gid = i * n2n3 + j * n3 + k;\n", + " auto value = coeff[0] * prev[gid];\n", + " \n", + " // Calculate values for each cell\n", + "#pragma unroll(8)\n", + " for (int x = 1; x <= kHalfLength; x++) {\n", + " value += coeff[x] * (prev[gid + x] + prev[gid - x] +\n", + " prev[gid + x * n3] + prev[gid - x * n3] +\n", + " prev[gid + x * n2n3] + prev[gid - x * n2n3]);\n", + " }\n", + " next[gid] = 2.0f * prev[gid] - next[gid] + value * vel[gid];\n", + " \n", + " // End of device code\n", + " });\n", + " }).wait();\n", + "\n", + " // Swap the buffers for always having current values in prev buffer\n", + " std::swap(next, prev);\n", + " }\n", + " q.memcpy(ptr_prev, prev, sizeof(float)*grid_size);\n", + "\n", + " sycl::free(next - (16 - b1),q);\n", + " sycl::free(prev - (16 - b1),q);\n", + " sycl::free(vel - (16 - b1),q);\n", + " sycl::free(coeff,q);\n", + " return true;\n", + "}\n", + "\n", + "int main(int argc, char* argv[]) {\n", + " // Arrays used to update the wavefield\n", + " float* prev;\n", + " float* next;\n", + " // Array to store wave velocity\n", + " float* vel;\n", + "\n", + " // Variables to store size of grids and number of simulation iterations\n", + " size_t n1, n2, n3;\n", + " size_t num_iterations;\n", + "\n", + " // Flag to verify results with CPU version\n", + " bool verify = false;\n", + "\n", + " if (argc < 5) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " try {\n", + " // Parse command line arguments and increase them by HALO\n", + " n1 = std::stoi(argv[1]) + (2 * kHalfLength);\n", + " n2 = std::stoi(argv[2]) + (2 * kHalfLength);\n", + " n3 = std::stoi(argv[3]) + (2 * kHalfLength);\n", + " num_iterations = std::stoi(argv[4]);\n", + " if (argc > 5) verify = true;\n", + " } catch (...) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Validate input sizes for the grid\n", + " if (ValidateInput(n1, n2, n3, num_iterations)) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Create queue and print target info with default selector and in order\n", + " // property\n", + " queue q(default_selector_v, {property::queue::in_order()});\n", + " std::cout << \" Running linear indexed GPU version\\n\";\n", + " printTargetInfo(q);\n", + "\n", + " // Compute the total size of grid\n", + " size_t nsize = n1 * n2 * n3;\n", + "\n", + " prev = new float[nsize];\n", + " next = new float[nsize];\n", + " vel = new float[nsize];\n", + "\n", + " // Compute coefficients to be used in wavefield update\n", + " float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1,\n", + " +7.572087e-2, -1.76767677e-2, +3.480962e-3,\n", + " -5.180005e-4, +5.074287e-5, -2.42812e-6};\n", + "\n", + " // Apply the DX, DY and DZ to coefficients\n", + " coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);\n", + " for (auto i = 1; i <= kHalfLength; i++) {\n", + " coeff[i] = coeff[i] / (dxyz * dxyz);\n", + " }\n", + "\n", + " // Initialize arrays and introduce initial conditions (source)\n", + " initialize(prev, next, vel, n1, n2, n3);\n", + "\n", + " auto start = std::chrono::steady_clock::now();\n", + "\n", + " // Invoke the driver function to perform 3D wave propagation offloaded to\n", + " // the device\n", + " iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);\n", + "\n", + " auto end = std::chrono::steady_clock::now();\n", + " auto time = std::chrono::duration_cast(end - start)\n", + " .count();\n", + " printStats(time, n1, n2, n3, num_iterations);\n", + "\n", + " // Verify result with the CPU serial version\n", + " if (verify) {\n", + " VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);\n", + " }\n", + "\n", + " delete[] prev;\n", + " delete[] next;\n", + " delete[] vel;\n", + "\n", + " return 0;\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the application is created, we can run it from the command line by using few parameters as following:\n", + "src/3_GPU_linear 1024 1024 1024 100\n", + "
    \n", + "
  • bin/3_GPU_linear is the binary
  • \n", + "
  • 1024 1024 1024 are the size for the 3 dimensions, increasing it will result in more computation time
  • \n", + "
  • 100 is the number of time steps
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_linear_usm.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_linear_usm.sh; else ./run_gpu_linear_usm.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ISO3DFD Linear using Buffers and Accessors\n", + "\n", + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/3_GPU_linear.cpp\n", + "//==============================================================\n", + "// Copyright © Intel Corporation\n", + "//\n", + "// SPDX-License-Identifier: MIT\n", + "// =============================================================\n", + "#include \n", + "#include \n", + "#include \n", + "#include \n", + "\n", + "#include \"Utils.hpp\"\n", + "\n", + "using namespace sycl;\n", + "\n", + "void iso3dfd(sycl::queue &q, float *ptr_next, float *ptr_prev,\n", + " float *ptr_vel, float *ptr_coeff, size_t n1, size_t n2,\n", + " size_t n3, size_t n1_block, size_t n2_block, size_t n3_block,\n", + " size_t end_z, unsigned int nIterations) {\n", + " // Create 3D SYCL range for kernels which not include HALO\n", + " range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,\n", + " n3 - 2 * kHalfLength);\n", + " // Create 1D SYCL range for buffers which include HALO\n", + " range<1> buffer_range(n1 * n2 * n3);\n", + " // Create buffers using SYCL class buffer\n", + " buffer next_buf(next, buffer_range);\n", + " buffer prev_buf(prev, buffer_range);\n", + " buffer vel_buf(vel, buffer_range);\n", + " buffer coeff_buf(coeff, range(kHalfLength + 1));\n", + "\n", + " for (auto it = 0; it < nreps; it++) {\n", + " // Submit command group for execution\n", + " q.submit([&](handler& h) {\n", + " // Create accessors\n", + " accessor next_acc(next_buf, h);\n", + " accessor prev_acc(prev_buf, h);\n", + " accessor vel_acc(vel_buf, h, read_only);\n", + " accessor coeff_acc(coeff_buf, h, read_only);\n", + "\n", + " // Send a SYCL kernel(lambda) to the device for parallel execution\n", + " // Each kernel runs single cell\n", + " h.parallel_for(kernel_range, [=](id<3> nidx) {\n", + " // Start of device code\n", + " // Add offsets to indices to exclude HALO\n", + " int n2n3 = n2 * n3;\n", + " int i = nidx[0] + kHalfLength;\n", + " int j = nidx[1] + kHalfLength;\n", + " int k = nidx[2] + kHalfLength;\n", + "\n", + " // Calculate linear index for each cell\n", + " int idx = i * n2n3 + j * n3 + k;\n", + "\n", + " // Calculate values for each cell\n", + " float value = prev_acc[idx] * coeff_acc[0];\n", + "#pragma unroll(8)\n", + " for (int x = 1; x <= kHalfLength; x++) {\n", + " value +=\n", + " coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] +\n", + " prev_acc[idx + x * n3] + prev_acc[idx - x * n3] +\n", + " prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]);\n", + " }\n", + " next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] +\n", + " value * vel_acc[idx];\n", + " // End of device code\n", + " });\n", + " });\n", + "\n", + " // Swap the buffers for always having current values in prev buffer\n", + " std::swap(next_buf, prev_buf);\n", + " }\n", + "}\n", + "\n", + "int main(int argc, char* argv[]) {\n", + " // Arrays used to update the wavefield\n", + " float* prev;\n", + " float* next;\n", + " // Array to store wave velocity\n", + " float* vel;\n", + "\n", + " // Variables to store size of grids and number of simulation iterations\n", + " size_t n1, n2, n3;\n", + " size_t num_iterations;\n", + "\n", + " // Flag to verify results with CPU version\n", + " bool verify = false;\n", + "\n", + " if (argc < 5) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " try {\n", + " // Parse command line arguments and increase them by HALO\n", + " n1 = std::stoi(argv[1]) + (2 * kHalfLength);\n", + " n2 = std::stoi(argv[2]) + (2 * kHalfLength);\n", + " n3 = std::stoi(argv[3]) + (2 * kHalfLength);\n", + " num_iterations = std::stoi(argv[4]);\n", + " if (argc > 5) verify = true;\n", + " } catch (...) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Validate input sizes for the grid\n", + " if (ValidateInput(n1, n2, n3, num_iterations)) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Create queue and print target info with default selector and in order\n", + " // property\n", + " queue q(default_selector_v, {property::queue::in_order()});\n", + " std::cout << \" Running linear indexed GPU version\\n\";\n", + " printTargetInfo(q);\n", + "\n", + " // Compute the total size of grid\n", + " size_t nsize = n1 * n2 * n3;\n", + "\n", + " prev = new float[nsize];\n", + " next = new float[nsize];\n", + " vel = new float[nsize];\n", + "\n", + " // Compute coefficients to be used in wavefield update\n", + " float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1,\n", + " +7.572087e-2, -1.76767677e-2, +3.480962e-3,\n", + " -5.180005e-4, +5.074287e-5, -2.42812e-6};\n", + "\n", + " // Apply the DX, DY and DZ to coefficients\n", + " coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);\n", + " for (auto i = 1; i <= kHalfLength; i++) {\n", + " coeff[i] = coeff[i] / (dxyz * dxyz);\n", + " }\n", + "\n", + " // Initialize arrays and introduce initial conditions (source)\n", + " initialize(prev, next, vel, n1, n2, n3);\n", + "\n", + " auto start = std::chrono::steady_clock::now();\n", + "\n", + " // Invoke the driver function to perform 3D wave propagation offloaded to\n", + " // the device\n", + " iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);\n", + "\n", + " auto end = std::chrono::steady_clock::now();\n", + " auto time = std::chrono::duration_cast(end - start)\n", + " .count();\n", + " printStats(time, n1, n2, n3, num_iterations);\n", + "\n", + " // Verify result with the CPU serial version\n", + " if (verify) {\n", + " VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);\n", + " }\n", + "\n", + " delete[] prev;\n", + " delete[] next;\n", + " delete[] vel;\n", + "\n", + " return 0;\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the application is created, we can run it from the command line by using few parameters as following:\n", + "src/3_GPU_linear 1024 1024 1024 100\n", + "
    \n", + "
  • bin/3_GPU_linear is the binary
  • \n", + "
  • 1024 1024 1024 are the size for the 3 dimensions, increasing it will result in more computation time
  • \n", + "
  • 100 is the number of time steps
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_linear.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_linear.sh; else ./run_gpu_linear.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ISO3DFD GPU Optimizations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* We started from a code version running with standard C++ on the CPU.\n", + "* Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.\n", + "* We identifed the application is bound by Integer opearations.\n", + "* And finally we fixed the indexing in the current module to make the code more optimized.\n", + "* The next step, is to to run the Roofline Model and VTune to\n", + " * Check the current optimizations to see if we fixed the application being compute and INT bound\n", + " * And look for oppurtunites to optimize further on the GPU to understand if we still have obvious bottlenecks." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Running the GPU Roofline Analysis\n", + "With the offload implemented in 3_GPU_linear using SYCL, we'll want to run roofline analysis to see the improvements we made to the application and look for more areas where there is room for performance optimization.\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication \n", + "```\n", + "The iso3DFD GPU Linear code can be run using\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu -- ./build/src/3_GPU_linear 1024 1024 1024 100\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_roofline_advisor_usm.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_roofline_advisor_usm.sh; else ./run_gpu_roofline_advisor_usm.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyzing the output" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the roofline analysis of the 3_GPU_linear.cpp version, we can see that the performance is close to predicted. \n", + "As noted in the below roofline model we can observe that,\n", + "\n", + "* The Improvements we see are :\n", + " * GINTOPS is 3X lower now compared to the previous version of the GPU code without linear indexing optimizations. Similary we got more GFLOPS\n", + " * Lesser Data transfer time\n", + " * Higher bandwidth usage\n", + "* Bottlenecks we see are:\n", + " * The application is now bounded by memory, specifically by the L3 bandwidth.\n", + "\n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Roofline Analysis report overview\n", + "To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. \n", + "\n", + "[View the report in HTML](reports/advisor_report_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/advisor_report_linear.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Generating VTune reports\n", + "Below exercises we use VTune™ analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.\n", + "\n", + "#### Learn more about VTune\n", + "​\n", + "There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/3_GPU_linear 1024 1024 1024 100\n", + "```\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/3_GPU_linear 1024 1024 1024 100\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html\n", + "```\n", + "\n", + "[View the report in HTML](reports/output_offload_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_offload_linear.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[View the report in HTML](reports/output_hotspots_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_hotspots_linear.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_linear_vtune.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_linear_vtune.sh; else ./run_gpu_linear_vtune.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "### Next Iteration of implemeting GPU Optimizations\n", + "We ran the roofline model and observed:\n", + "\n", + "* With the code changes that are in the 3_GPU_linear.cpp file, we can see in the roofline model that the INT operations decreased significantly \n", + "* The kernel now has much lower arithmetic intensity and increased bandwidth\n", + "* But now we can see the application is now bounded by memory i.e L3 bandwidth\n", + "* In this next iteration, we'll address the problem being memory bound in kernels by increasing the L1 cache reuse." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/q b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/q new file mode 100644 index 0000000000..9bbad910d7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/q @@ -0,0 +1,49 @@ +#!/bin/bash +#========================================== +# Copyright © Intel Corporation +# +# SPDX-License-Identifier: MIT +#========================================== +# Script to submit job in Intel(R) DevCloud +# Version: 0.71 +#========================================== +if [ -z "$1" ]; then + echo "Missing script argument, Usage: ./q run.sh" +elif [ ! -f "$1" ]; then + echo "File $1 does not exist" +else + echo "Job has been submitted to Intel(R) DevCloud and will execute soon." + echo "" + script=$1 + # Remove old output files + rm *.sh.* > /dev/null 2>&1 + # Submit job using qsub + qsub_id=`qsub -l nodes=1:gpu:ppn=2 -d . $script` + job_id="$(cut -d'.' -f1 <<<"$qsub_id")" + # Print qstat output + qstat + # Wait for output file to be generated and display + echo "" + echo -ne "Waiting for Output " + until [ -f $script.o$job_id ]; do + sleep 1 + echo -ne "█" + ((timeout++)) + # Timeout if no output file generated within 60 seconds + if [ $timeout == 70 ]; then + echo "" + echo "" + echo "TimeOut 60 seconds: Job is still queued for execution, check for output file later ($script.o$job_id)" + echo "" + break + fi + done + # Print output and error file content if exist + if [ -n "$(find -name '*.sh.o'$job_id)" ]; then + echo " Done⬇" + cat $script.o$job_id + cat $script.e$job_id + echo "Job Completed in $timeout seconds." + rm *.sh.*$job_id > /dev/null 2>&1 + fi +fi diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/advisor_report_linear.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/advisor_report_linear.html new file mode 100644 index 0000000000..73e0ffdee4 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/advisor_report_linear.html @@ -0,0 +1,2 @@ +Intel Advisor Report
\ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_hotspots_linear.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_hotspots_linear.html new file mode 100644 index 0000000000..646676c71a --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_hotspots_linear.html @@ -0,0 +1,144 @@ + +
Intel® VTune Profiler 2024.0.0
  • Elapsed Time: + 17.718s
    • GPU Time: + 6.713s
  • Display controller: Intel Corporation Device 0x0bda Device Group: +
    • XVE Array Stalled/Idle: + 85.9% of Elapsed time with GPU busy
      The percentage of time when the XVEs were stalled or idle is high, which has a negative impact on compute-bound applications.
      • This section shows the XVE metrics per stack and per adapter for all the devices in this group.: +
        GPU StackGPU AdapterXVE Array Active(%)XVE Array Stalled(%)XVE Array Idle(%)
        0GPU 10.0%0.0%100.0%
        0GPU 30.0%0.0%100.0%
        0GPU 021.4%16.3%62.3%
        0GPU 20.0%0.0%100.0%
    • GPU L3 Bandwidth Bound: + 5.6% of peak value
    • Occupancy: + 21.3% of peak value
      Several factors including shared local memory, use of memory barriers, and inefficient work scheduling can cause a low value of the occupancy metric.
      • This section shows the computing tasks with low occupancy metric for all the devices in this group.: +
        Computing TaskTotal TimeOccupancy(%)SIMD Utilization(%)
        iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<(int)3>)#1}6.713s21.3% of peak value25.0%
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/3_GPU_linear "1024" "1024" "1024" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-06
    • Result Size: + 116.1 MB
    • Collection start time: + 21:33:53 18/03/2024 UTC
    • Collection stop time: + 21:34:11 18/03/2024 UTC
    • Collector Type: + Event-based sampling driver,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_offload_linear.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_offload_linear.html new file mode 100644 index 0000000000..1e03896d48 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/reports/output_offload_linear.html @@ -0,0 +1,142 @@ + +
Intel® VTune Profiler 2024.0.0

Recommendations:

GPU Time, % of Elapsed time: 37.1%
GPU utilization is low. Switch to the for in-depth analysis of host activity. Poor GPU utilization can prevent the application from offloading effectively.
XVE Array Stalled/Idle: 43.5% of Elapsed time with GPU busy
GPU metrics detect some kernel issues. Use GPU Compute/Media Hotspots (preview) to understand how well your application runs on the specified hardware.
  • Elapsed Time: + 18.085s
    • GPU Time, % of Elapsed time: + 37.1%
      GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.
      • GPU Time, % of Elapsed time: +
        GPU AdapterGPU EngineGPU TimeGPU Time, % of Elapsed time(%)
        GPU 0Render and GPGPU6.713s37.1%
      • Top Hotspots when GPU was idle: +
        FunctionModuleCPU Time
        func@0x13f9b0libze_intel_gpu.so.1.3.27191.4219.011s
        asm_exc_page_faultvmlinux4.960s
        _raw_spin_lockvmlinux2.108s
        [Skipped stack frame(s)][Unknown]1.684s
        asm_exc_int3vmlinux1.672s
        [Others]N/A27.632s
  • Hottest Host Tasks: +
    Host TaskTask Time% of Elapsed Time(%)Task Count
    zeEventHostSynchronize7.143s39.5%14
    zeCommandListAppendMemoryCopy1.687s9.3%6
    zeModuleCreate0.135s0.7%1
    zeCommandListAppendLaunchKernel0.002s0.0%100
    zeCommandListCreateImmediate0.001s0.0%3
    [Others]0.000s0.0%5
  • Hottest GPU Computing Tasks: +
    Computing TaskTotal TimeExecution Time% of Total Time(%)SIMD Width
    iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<(int)3>)#1}8.840s6.713s75.9%32
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/3_GPU_linear "1024" "1024" "1024" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-06
    • Result Size: + 173.7 MB
    • Collection start time: + 21:33:01 18/03/2024 UTC
    • Collection stop time: + 21:33:19 18/03/2024 UTC
    • Collector Type: + Event-based sampling driver,Driverless Perf system-wide sampling,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear.sh new file mode 100644 index 0000000000..69db2bed4d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +rm -rf build +build="$PWD/build" +[ ! -d "$build" ] && mkdir -p "$build" +cd build && +cmake .. && +make run_gpu_linear diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_roofline.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_roofline.sh new file mode 100644 index 0000000000..b544b77010 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_roofline.sh @@ -0,0 +1,10 @@ +#!/bin/bash +#advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu -- ./build/src/3_GPU_linear 256 256 256 100 + +advisor --collect=survey --profile-gpu -project-dir=./roofline_linear -- ./build/src/3_GPU_linear 256 256 256 100 +advisor --collect=tripcounts --profile-gpu --project-dir=./roofline_linear -- ./build/src/3_GPU_linear 256 256 256 100 +advisor --collect=projection --profile-gpu --model-baseline-gpu --project-dir=./roofline_linear +advisor --report=roofline --gpu --project-dir=roofline_linear --report-output=./roofline_linear/roofline_linear.html + + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_usm.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_usm.sh new file mode 100644 index 0000000000..8d60256947 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_usm.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +rm -rf build +build="$PWD/build" +[ ! -d "$build" ] && mkdir -p "$build" +cd build && +cmake .. && +make run_gpu_linear_usm diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_vtune.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_vtune.sh new file mode 100644 index 0000000000..37c1230ba2 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_linear_vtune.sh @@ -0,0 +1,6 @@ +#!/bin/bash +vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir_linear -- ./build/src/3_GPU_linear 1024 1024 1024 100 +vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots_linear -- ./build/src/3_GPU_linear 1024 1024 1024 100 +vtune -report summary -result-dir vtune_dir_linear -format html -report-output ./reports/output_offload_linear.html +vtune -report summary -result-dir vtune_dir_hotspots_linear -format html -report-output ./reports/output_hotspots_linear.html + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor.sh new file mode 100644 index 0000000000..b8c7839f45 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu/ -- ./build/src/3_GPU_linear 1024 1024 1024 100 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor_usm.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor_usm.sh new file mode 100644 index 0000000000..f1eb79597f --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/run_gpu_roofline_advisor_usm.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu/usm -- ./build/src/3_GPU_linear_USM 1024 1024 1024 100 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/1_CPU_only.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/1_CPU_only.cpp new file mode 100644 index 0000000000..97730a9aec --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/1_CPU_only.cpp @@ -0,0 +1,129 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include + +#include "Utils.hpp" + +void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, const size_t n1, + const size_t n2, const size_t n3) { + auto dimn1n2 = n1 * n2; + + // Remove HALO from the end + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + // Calculate start pointers for the row over X dimension + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + // Iterate over X + for (auto ix = kHalfLength; ix < n1_end; ix++) { + // Calculate values for each cell + float value = ptr_prev[ix] * coeff[0]; + for (int i = 1; i <= kHalfLength; i++) { + value += + coeff[i] * + (ptr_prev[ix + i] + ptr_prev[ix - i] + + ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] + + ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]); + } + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void iso3dfd(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it++) { + iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3); + // Swap the pointers for always having current values in prev array + std::swap(next, prev); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + std::cout << "Running on CPU serial version\n"; + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation 1 thread serial + // version + iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/2_GPU_basic.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/2_GPU_basic.cpp new file mode 100644 index 0000000000..3571f98bfc --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/2_GPU_basic.cpp @@ -0,0 +1,153 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 3D SYCL range for buffers which include HALO + range<3> buffer_range(n1, n2, n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate values for each cell + float value = prev_acc[i][j][k] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] + + prev_acc[i][j + x][k] + prev_acc[i][j - x][k] + + prev_acc[i + x][j][k] + prev_acc[i - x][j][k]); + } + next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] + + value * vel_acc[i][j][k]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running GPU basic offload version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear.cpp new file mode 100644 index 0000000000..cff88014de --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear.cpp @@ -0,0 +1,157 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + + // Create 1D SYCL range for buffers which include HALO + range<1> buffer_range(n1 * n2 * n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it++) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> nidx) { + // Start of device code + // Add offsets to indices to exclude HALO + int n2n3 = n2 * n3; + int i = nidx[0] + kHalfLength; + int j = nidx[1] + kHalfLength; + int k = nidx[2] + kHalfLength; + + // Calculate linear index for each cell + int idx = i * n2n3 + j * n3 + k; + + // Calculate values for each cell + float value = prev_acc[idx] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] + + prev_acc[idx + x * n3] + prev_acc[idx - x * n3] + + prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]); + } + next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] + + value * vel_acc[idx]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + //if (verify) { + //VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + //} + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear_USM.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear_USM.cpp new file mode 100644 index 0000000000..572ffba269 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/3_GPU_linear_USM.cpp @@ -0,0 +1,170 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +bool iso3dfd(sycl::queue &q, float *ptr_next, float *ptr_prev, + float *ptr_vel, float *ptr_coeff, size_t n1, size_t n2, + size_t n3, unsigned int nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = n1*n2*n3; + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength + 1, q); + //coeff += (16 - b1); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + q.wait(); + + for (auto it = 0; it < nIterations; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int n2n3 = n2 * n3; + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate linear index for each cell + int gid = i * n2n3 + j * n3 + k; + auto value = coeff[0] * prev[gid]; + + // Calculate values for each cell +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += coeff[x] * (prev[gid + x] + prev[gid - x] + + prev[gid + x * n3] + prev[gid - x * n3] + + prev[gid + x * n2n3] + prev[gid - x * n2n3]); + } + next[gid] = 2.0f * prev[gid] - next[gid] + value * vel[gid]; + + // End of device code + }); + }).wait(); + + // Swap the buffers for always having current values in prev buffer + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + return true; +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/4_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/4_GPU_optimized.cpp new file mode 100644 index 0000000000..99dd9d85b8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/4_GPU_optimized.cpp @@ -0,0 +1,171 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + //coeff += (16 - b1); + //q.memcpy(coeff, coeff, sizeof(float)*grid_size); + q.wait(); + + //auto local_nd_range = range(1, n2_block, n1_block); + //auto global_nd_range = range((n3 - 2 * kHalfLength)/n3_block, (n2 - 2 * kHalfLength)/n2_block, + //(n1 - 2 * kHalfLength)); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*b3+n3_block-1)/n3_block*n3_block,(n2-2*b2+n2_block-1)/n2_block*n2_block,n1_block); + + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + //[[intel::reqd_sub_group_size(32)]] + //[[intel::kernel_args_restrict]] + { + const int iz = b3 + item.get_global_id(0); + const int iy = b2 + item.get_global_id(1); + if (iz < n3 - b3 && iy < n2 - b2) + for (int ix = b1+item.get_global_id(2); ix < n1 - b1; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/5_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/5_GPU_optimized.cpp new file mode 100644 index 0000000000..8386e61caa --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/5_GPU_optimized.cpp @@ -0,0 +1,264 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps, const size_t kernel_iterations, + const size_t n2_workGroupSize, const size_t n3_workGroupSize) { + // Create 3D SYCL range for kernels which not include HALO and slices first dimension + range<3> kernel_range((n1 - 2 * kHalfLength) / kernel_iterations, + (n2 - 2 * kHalfLength), + (n3 - 2 * kHalfLength)); + // Create 3D SYCL range for work group size + range<3> workGroupSize(1, n2_workGroupSize, n3_workGroupSize); + // Create 1D SYCL range for buffers which include HALO + range<1> buffer_range(n1 * n2 * n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it++) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Create 1D SYCL range for Shared Local Memory(SLM) which includes HALO + range<1> local_range((n2_workGroupSize + 2 * kHalfLength) * + (n3_workGroupSize + 2 * kHalfLength)); + // Create an accessor for SLM buffer which will contains data used + // multiple times by work group + local_accessor tab(local_range, h); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single row slice over first dimension + h.parallel_for( + nd_range(kernel_range, workGroupSize), [=](nd_item<3> nidx) { + // Start of device code + // Add offsets to indices to exclude HALO + // Get the global and local indices + // Start and end index used in loop + int n2n3 = n2 * n3; + int l_n3 = n3_workGroupSize + 2 * kHalfLength; + int i = nidx.get_global_id(0) * kernel_iterations + kHalfLength; + int j = nidx.get_global_id(1) + kHalfLength; + int k = nidx.get_global_id(2) + kHalfLength; + int end_i = i + kernel_iterations; + int l_j = nidx.get_local_id(1) + kHalfLength; + int l_k = nidx.get_local_id(2) + kHalfLength; + + // Calculate global and local() linear index for each cell + int idx = i * n2n3 + j * n3 + k; + int l_idx = l_j * l_n3 + l_k; + + + // Create arrays to store data used multiple times + // Local copy of coeff buffer and continous values over first dimension which are + // used to calculate stencil front and back arrays are used to + // ensure the values over first dimension are read once, shifted in` + // these array and re-used multiple times before being discarded + // This is an optimization technique to enable data-reuse and + // improve overall FLOPS to BYTES read ratio + float coeff[kHalfLength + 1]; + float front[kHalfLength + 1]; + float back[kHalfLength]; + + // Fill local arrays, front[0] contains current cell value + for (int x = 0; x <= kHalfLength; x++) { + coeff[x] = coeff_acc[x]; + front[x] = prev_acc[idx + n2n3 * x]; + } + for (int x = 1; x <= kHalfLength; x++) { + back[x-1] = prev_acc[idx - n2n3 * x]; + } + + // Check if work item should copy HALO data + bool copy_halo_z = + (nidx.get_local_id(0) < kHalfLength) ? true : false; + bool copy_halo_x = + (nidx.get_local_id(2) < kHalfLength) ? true : false; + + // Iterate over first dimension excluding HALO + for (; i < end_i; i++) { + + // Copy HALO data to SLM if needed + if (copy_halo_x) { + tab[l_idx - kHalfLength] = prev_acc[idx - kHalfLength]; + tab[l_idx + n3_workGroupSize] = + prev_acc[idx + n3_workGroupSize]; + } + if (copy_halo_z) { + tab[l_idx - kHalfLength * l_n3] = prev_acc[idx - n3 * kHalfLength]; + tab[l_idx + n2_workGroupSize * l_n3] = + prev_acc[idx + n3 * n2_workGroupSize]; + } + + // Copy current data to SLM + tab[l_idx] = front[0]; + + // SYCL Basic synchronization (barrier function) + // Force synchronization within a work-group + // using barrier function to ensure + // all the work-items have completed reading into the SLM buffer + nidx.barrier(access::fence_space::local_space); + + // Calculate values for each cell + float value = front[0] * coeff[0]; + #pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += coeff[x] * + (tab[l_idx + x] + tab[l_idx - x] + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x] + + front[x] + back[x - 1]); + } + next_acc[idx] = 2.0f * front[0] - next_acc[idx] + + value * vel_acc[idx]; + + // Increase linear index, jump to the next cell in first dimension + idx += n2n3; + + // Shift values in front and back arrays + for (auto x = kHalfLength - 1; x > 0; x--) { + back[x] = back[x - 1]; + } + back[0] = front[0]; + + for (auto x = 0; x < kHalfLength; x++) { + front[x] = front[x + 1]; + } + front[kHalfLength] = prev_acc[idx + kHalfLength * n2n3]; + + // SYCL Basic synchronization (barrier function) + // Force synchronization within a work-group + // using barrier function to ensure that SLM buffers + // are not overwritten by next set of work-items + // (highly unlikely but not impossible) + nidx.barrier(access::fence_space::local_space); + + } + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids, number of simulation iterations, + // work group size and kernel iterations (size of slice over Y) + size_t n1, n2, n3; + size_t num_iterations; + size_t kernel_iterations; + size_t n2_workGroupSize, n3_workGroupSize; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 8) { + Usage(argv[0], true); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + kernel_iterations = std::stoi(argv[5]); + n2_workGroupSize = std::stoi(argv[6]); + n3_workGroupSize = std::stoi(argv[7]); + if (argc > 8) verify = true; + } catch (...) { + Usage(argv[0], true); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations, kernel_iterations, + n2_workGroupSize, n3_workGroupSize)) { + Usage(argv[0], true); + return 1; + } + + // Create queue with default selector and in order property + queue q(default_selector_v, {property::queue::in_order()}); + + if (CheckWorkGroupSize(q, n2_workGroupSize, n3_workGroupSize)) { + Usage(argv[0], true); + return 1; + } + + std::cout << " Running GPU optimized version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations, + kernel_iterations, n2_workGroupSize, n3_workGroupSize); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/CMakeLists.txt new file mode 100644 index 0000000000..675ce6be73 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/CMakeLists.txt @@ -0,0 +1,33 @@ +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fsycl --std=c++17") +# Set default build type to RelWithDebInfo if not specified +if (NOT CMAKE_BUILD_TYPE) + message (STATUS "Default CMAKE_BUILD_TYPE not set using Release with Debug Info") + set (CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE + STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel" + FORCE) +endif() + +set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}") + +add_executable(1_CPU_only 1_CPU_only.cpp) +add_executable(2_GPU_basic 2_GPU_basic.cpp) +add_executable(3_GPU_linear 3_GPU_linear.cpp) +add_executable(3_GPU_linear_USM 3_GPU_linear_USM.cpp) +add_executable(4_GPU_optimized 4_GPU_optimized.cpp) + +target_link_libraries(1_CPU_only OpenCL sycl) +target_link_libraries(2_GPU_basic OpenCL sycl) +target_link_libraries(3_GPU_linear OpenCL sycl) +target_link_libraries(3_GPU_linear_USM OpenCL sycl) +target_link_libraries(4_GPU_optimized OpenCL sycl) + +add_custom_target(run_all 1_CPU_only 1024 1024 1024 100 + COMMAND 2_GPU_basic 1024 1024 1024 100 + COMMAND 3_GPU_linear 1024 1024 1024 100 + COMMAND 3_GPU_linear_USM 1024 1024 1024 100 + COMMAND 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_cpu 1_CPU_only 1024 1024 1024 100) +add_custom_target(run_gpu_basic 2_GPU_basic 1024 1024 1024 100) +add_custom_target(run_gpu_linear 3_GPU_linear 1024 1024 1024 100) +add_custom_target(run_gpu_optimized 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_gpu_linear_usm 3_GPU_linear_USM 1024 1024 1024 100) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Iso3dfd.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Iso3dfd.hpp new file mode 100644 index 0000000000..e3487fa0cf --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Iso3dfd.hpp @@ -0,0 +1,21 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +constexpr size_t kHalfLength = 8; +constexpr float dxyz = 50.0f; +constexpr float dt = 0.002f; + +#define STENCIL_LOOKUP(ir) \ + (coeff[ir] * ((ptr_prev[ix + ir] + ptr_prev[ix - ir]) + \ + (ptr_prev[ix + ir * n1] + ptr_prev[ix - ir * n1]) + \ + (ptr_prev[ix + ir * dimn1n2] + ptr_prev[ix - ir * dimn1n2]))) + + +#define KERNEL_STENCIL_LOOKUP(x) \ + coeff[x] * (tab[l_idx + x] + tab[l_idx - x] + front[x] + back[x - 1] \ + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x]) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Utils.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Utils.hpp new file mode 100644 index 0000000000..98d4a6e12c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/03_ISO3DFD_GPU_Linear/src/Utils.hpp @@ -0,0 +1,259 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +#include +#include + +#include "Iso3dfd.hpp" + +void Usage(const std::string& programName, bool usedNd_ranges = false) { + std::cout << "--------------------------------------\n"; + std::cout << " Incorrect parameters \n"; + std::cout << " Usage: "; + std::cout << programName << " n1 n2 n3 Iterations"; + + if (usedNd_ranges) std::cout << " kernel_iterations n2_WGS n3_WGS"; + + std::cout << " [verify]\n\n"; + std::cout << " n1 n2 n3 : Grid sizes for the stencil\n"; + std::cout << " Iterations : No. of timesteps.\n"; + + if (usedNd_ranges) { + std::cout + << " kernel_iterations : No. of cells calculated by one kernel\n"; + std::cout << " n2_WGS n3_WGS : n2 and n3 work group sizes\n"; + } + std::cout + << " [verify] : Optional: Compare results with CPU version\n"; + std::cout << "--------------------------------------\n"; + std::cout << "--------------------------------------\n"; +} + +bool ValidateInput(size_t n1, size_t n2, size_t n3, size_t num_iterations, + size_t kernel_iterations = -1, size_t n2_WGS = kHalfLength, + size_t n3_WGS = kHalfLength) { + if ((n1 < kHalfLength) || (n2 < kHalfLength) || (n3 < kHalfLength) || + (n2_WGS < kHalfLength) || (n3_WGS < kHalfLength)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid grid size : n1, n2, n3, n2_WGS, n3_WGS should be " + "greater than " + << kHalfLength << "\n"; + return true; + } + + if ((n2 < n2_WGS) || (n3 < n3_WGS)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid work group size : n2 should be greater than n2_WGS " + "and n3 greater than n3_WGS\n"; + return true; + } + + if (((n2 - 2 * kHalfLength) % n2_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n2 should be multiple of n2_WGS - " + << n2_WGS << "\n"; + return true; + } + if (((n3 - 2 * kHalfLength) % n3_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n3 should be multiple of n3_WGS - " + << n3_WGS << "\n"; + return true; + } + if (((n1 - 2 * kHalfLength) % kernel_iterations) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n1 should be multiple of " + "kernel_iterations - " + << kernel_iterations << "\n"; + return true; + } + + return false; +} + +bool CheckWorkGroupSize(sycl::queue& q, unsigned int n2_WGS, + unsigned int n3_WGS) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + if ((max_block_size > 1) && (n2_WGS * n3_WGS > max_block_size)) { + std::cout << "ERROR: Invalid block sizes: n2_WGS * n3_WGS should be " + "less than or equal to " + << max_block_size << "\n"; + return true; + } + + return false; +} + +void printTargetInfo(sycl::queue& q) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + auto max_exec_unit_count = + device.get_info(); + + std::cout << " Running on " << device.get_info() + << "\n"; + std::cout << " The Device Max Work Group Size is : " << max_block_size + << "\n"; + std::cout << " The Device Max EUCount is : " << max_exec_unit_count << "\n"; +} + +void initialize(float* ptr_prev, float* ptr_next, float* ptr_vel, size_t n1, + size_t n2, size_t n3) { + auto dim2 = n2 * n1; + + for (auto i = 0; i < n3; i++) { + for (auto j = 0; j < n2; j++) { + auto offset = i * dim2 + j * n1; + + for (auto k = 0; k < n1; k++) { + ptr_prev[offset + k] = 0.0f; + ptr_next[offset + k] = 0.0f; + ptr_vel[offset + k] = + 2250000.0f * dt * dt; // Integration of the v*v and dt*dt here + } + } + } + // Then we add a source + float val = 1.f; + for (auto s = 5; s >= 0; s--) { + for (auto i = n3 / 2 - s; i < n3 / 2 + s; i++) { + for (auto j = n2 / 4 - s; j < n2 / 4 + s; j++) { + auto offset = i * dim2 + j * n1; + for (auto k = n1 / 4 - s; k < n1 / 4 + s; k++) { + ptr_prev[offset + k] = val; + } + } + } + val *= 10; + } +} + +void printStats(double time, size_t n1, size_t n2, size_t n3, + size_t num_iterations) { + float throughput_mpoints = 0.0f, mflops = 0.0f, normalized_time = 0.0f; + double mbytes = 0.0f; + + normalized_time = (double)time / num_iterations; + throughput_mpoints = ((n1 - 2 * kHalfLength) * (n2 - 2 * kHalfLength) * + (n3 - 2 * kHalfLength)) / + (normalized_time * 1e3f); + mflops = (7.0f * kHalfLength + 5.0f) * throughput_mpoints; + mbytes = 12.0f * throughput_mpoints; + + std::cout << "--------------------------------------\n"; + std::cout << "time : " << time / 1e3f << " secs\n"; + std::cout << "throughput : " << throughput_mpoints << " Mpts/s\n"; + std::cout << "flops : " << mflops / 1e3f << " GFlops\n"; + std::cout << "bytes : " << mbytes / 1e3f << " GBytes/s\n"; + std::cout << "\n--------------------------------------\n"; + std::cout << "\n--------------------------------------\n"; +} + +bool WithinEpsilon(float* output, float* reference, const size_t dim_x, + const size_t dim_y, const size_t dim_z, + const unsigned int radius, const int zadjust = 0, + const float delta = 0.01f) { + std::ofstream error_file; + error_file.open("error_diff.txt"); + + bool error = false; + double norm2 = 0; + + for (size_t iz = 0; iz < dim_z; iz++) { + for (size_t iy = 0; iy < dim_y; iy++) { + for (size_t ix = 0; ix < dim_x; ix++) { + if (ix >= radius && ix < (dim_x - radius) && iy >= radius && + iy < (dim_y - radius) && iz >= radius && + iz < (dim_z - radius + zadjust)) { + float difference = fabsf(*reference - *output); + norm2 += difference * difference; + if (difference > delta) { + error = true; + error_file << " ERROR: " << ix << ", " << iy << ", " << iz << " " + << *output << " instead of " << *reference + << " (|e|=" << difference << ")\n"; + } + } + ++output; + ++reference; + } + } + } + + error_file.close(); + norm2 = sqrt(norm2); + if (error) std::cout << "error (Euclidean norm): " << norm2 << "\n"; + return error; +} + +void inline iso3dfdCPUIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, + const size_t n1, const size_t n2, + const size_t n3) { + auto dimn1n2 = n1 * n2; + + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + for (auto ix = kHalfLength; ix < n1_end; ix++) { + float value = ptr_prev[ix] * coeff[0]; + value += STENCIL_LOOKUP(1); + value += STENCIL_LOOKUP(2); + value += STENCIL_LOOKUP(3); + value += STENCIL_LOOKUP(4); + value += STENCIL_LOOKUP(5); + value += STENCIL_LOOKUP(6); + value += STENCIL_LOOKUP(7); + value += STENCIL_LOOKUP(8); + + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void CalculateReference(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it += 1) { + iso3dfdCPUIteration(next, prev, vel, coeff, n1, n2, n3); + std::swap(next, prev); + } +} + +void VerifyResult(float* prev, float* next, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + std::cout << "Running CPU version for result comparasion: "; + auto nsize = n1 * n2 * n3; + float* temp = new float[nsize]; + memcpy(temp, prev, nsize * sizeof(float)); + initialize(prev, next, vel, n1, n2, n3); + CalculateReference(next, prev, vel, coeff, n1, n2, n3, nreps); + bool error = WithinEpsilon(temp, prev, n1, n2, n3, kHalfLength, 0, 0.1f); + if (error) { + std::cout << "Final wavefields from SYCL device and CPU are not " + << "equivalent: Fail\n"; + } else { + std::cout << "Final wavefields from SYCL device and CPU are equivalent:" + << " Success\n"; + } + delete[] temp; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/CMakeLists.txt new file mode 100644 index 0000000000..e0bded3dae --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/CMakeLists.txt @@ -0,0 +1,4 @@ +cmake_minimum_required (VERSION 3.4) +set(CMAKE_CXX_COMPILER "icpx") +project (Iso3DFD) +add_subdirectory (src) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/img/4_iso.png b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/img/4_iso.png new file mode 100644 index 0000000000..226fcb3ed5 Binary files /dev/null and b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/img/4_iso.png differ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/iso3dfd_gpu_optimized.ipynb b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/iso3dfd_gpu_optimized.ipynb new file mode 100644 index 0000000000..c08f9e3d16 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/iso3dfd_gpu_optimized.ipynb @@ -0,0 +1,536 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ISO3DFD using nd_range kernel" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Objectives" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
    \n", + "
  • Understand how to further optimize the application using L1 cache reusage
  • \n", + "
  • Run roofline analysis and the VTune reports again to gauge the results
  • \n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Iso3DFD using nd_range kernel" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the previous activity, we used Intel® Advisor roofline analysis to decide on if the application is memory bound and specifically that the kernels have less cache reuse and we are bounded by the L3 memory bound which is all about re-using the memory.\n", + "\n", + "In this notebook, we'll address the problem being L3 memory bound in kernels by using dedicated cache reuse memory.\n", + "\n", + "The tuning puts more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache.\n", + "\n", + "To do this we need to change the kernel to nd_range; now they will not calculate only one cell but will iterate so that it schedules 1024 x 1 x 1 grid points on each SIMD16 core and all 1024 points share an L1 cache. The previous activity we schedule 16 x 1 x 1 grid points on each SIMD16 core and only 16 points share L1 cache.\n", + "\n", + "We can change the parameters passed to the application to find the best load for each work group. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Optimizing using nd_range kernel\n", + "The 4_GPU_optimized version of the sample addresses the memory issue constraints where we'll reuse data from L1 cache resue, where it schedules 1024 x 1 x 1 grid points on each SIMD16 core and all 1024 points share an L1 cache.\n", + "\n", + "\n", + "\n", + "```\n", + "\n", + "// Create USM objects \n", + " auto next = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " next += (16 - kHalfLength);\n", + " q.memcpy(next, ptr_next, sizeof(float)*grid_size);\n", + " auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " prev += (16 - kHalfLength);\n", + " q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);\n", + " auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " vel += (16 - kHalfLength);\n", + " q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);\n", + " //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q);\n", + " q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); \n", + " q.wait(); \n", + "```\n", + "\n", + "* The following integer function rounds N up to next multiple of M. Global nd_range must be integer multiple of local nd_range, so global nd_range is rounded to next multiple of local nd_range. A conditional statement is added to ensure any extra work items do no work.\n", + "\n", + "```\n", + "// Create 1D SYCL range for buffers which include HALO\n", + "range<1> buffer_range(n1 * n2 * n3);\n", + "auto global_nd_range = range<3>((n3-2*kHalfLength+n3_block-1)/n3_block*n3_block,(n2-2*kHalfLength+n2_block-1)/n2_block*n2_block,n1_block);\n", + "\n", + "```\n", + "* Change parallel_for to use nd_range. Here each work-item is doing more work reading from faster L1 cache.\n", + "\n", + "```\n", + "q.submit([&](auto &h) { \n", + " h.parallel_for(\n", + " nd_range(global_nd_range, local_nd_range), [=](auto item) \n", + " {\n", + " const int iz = kHalfLength + item.get_global_id(0);\n", + " const int iy = kHalfLength + item.get_global_id(1);\n", + " if (iz < n3 - kHalfLength && iy < n2 - kHalfLength)\n", + " for (int ix = kHalfLength+item.get_global_id(2); ix < n1 - kHalfLength; ix += n1_block)\n", + " {\n", + " auto gid = ix + iy*nx + iz*nxy;\n", + " float *pgid = prev+gid;\n", + " auto value = coeff[0] * pgid[0];\n", + "#pragma unroll(kHalfLength)\n", + " for (auto iter = 1; iter <= kHalfLength; iter++)\n", + " value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]);\n", + " next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid];\n", + " }\n", + " }); \n", + " }).wait();\n", + " std::swap(next, prev);\n", + "\n", + "```\n", + "We will run roofline analysis and the VTune reports again to gauge the results.\n", + "\n", + "The SYCL code below shows Iso3dFD GPU code using SYCL with Index optimizations: Inspect code, there are no modifications necessary:\n", + "1. Inspect the code cell below and click run ▶ to save the code to file\n", + "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/4_GPU_optimized.cpp\n", + "//==============================================================\n", + "// Copyright © Intel Corporation\n", + "//\n", + "// SPDX-License-Identifier: MIT\n", + "// =============================================================\n", + "\n", + "#include \n", + "#include \n", + "#include \n", + "#include \n", + "\n", + "#include \"Utils.hpp\"\n", + "\n", + "using namespace sycl;\n", + "\n", + "void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff,\n", + " const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block,\n", + " const size_t nIterations) {\n", + " auto nx = n1;\n", + " auto nxy = n1*n2;\n", + " auto grid_size = nxy*n3; \n", + "\n", + " auto next = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " next += (16 - kHalfLength);\n", + " q.memcpy(next, ptr_next, sizeof(float)*grid_size);\n", + " auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " prev += (16 - kHalfLength);\n", + " q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);\n", + " auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " vel += (16 - kHalfLength);\n", + " q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);\n", + " //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q);\n", + " auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q);\n", + " q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); \n", + " q.wait(); \n", + "\t\t\t\t \n", + " auto local_nd_range = range<3>(n3_block,n2_block,n1_block);\n", + " auto global_nd_range = range<3>((n3-2*kHalfLength+n3_block-1)/n3_block*n3_block,(n2-2*kHalfLength+n2_block-1)/n2_block*n2_block,n1_block);\n", + " \n", + "\n", + " for (auto i = 0; i < nIterations; i += 1) {\n", + " q.submit([&](auto &h) { \n", + " h.parallel_for(\n", + " nd_range(global_nd_range, local_nd_range), [=](auto item) \n", + " {\n", + " const int iz = kHalfLength + item.get_global_id(0);\n", + " const int iy = kHalfLength + item.get_global_id(1);\n", + " if (iz < n3 - kHalfLength && iy < n2 - kHalfLength)\n", + " for (int ix = kHalfLength+item.get_global_id(2); ix < n1 - kHalfLength; ix += n1_block)\n", + " {\n", + " auto gid = ix + iy*nx + iz*nxy;\n", + " float *pgid = prev+gid;\n", + " auto value = coeff[0] * pgid[0];\n", + "#pragma unroll(kHalfLength)\n", + " for (auto iter = 1; iter <= kHalfLength; iter++)\n", + " value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]);\n", + " next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid];\n", + " }\n", + " }); \n", + " }).wait();\n", + " std::swap(next, prev);\n", + " }\n", + " q.memcpy(ptr_prev, prev, sizeof(float)*grid_size);\n", + "\n", + " sycl::free(next - (16 - kHalfLength),q);\n", + " sycl::free(prev - (16 - kHalfLength),q);\n", + " sycl::free(vel - (16 - kHalfLength),q);\n", + " sycl::free(coeff,q); \n", + "\n", + "}\n", + "\n", + "int main(int argc, char* argv[]) {\n", + " // Arrays used to update the wavefield\n", + " float* prev;\n", + " float* next;\n", + " // Array to store wave velocity\n", + " float* vel;\n", + "\n", + " // Variables to store size of grids and number of simulation iterations\n", + " size_t n1, n2, n3;\n", + " size_t n1_block, n2_block, n3_block;\n", + " size_t num_iterations;\n", + "\n", + " // Flag to verify results with CPU version\n", + " bool verify = false;\n", + "\n", + " if (argc < 5) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " try {\n", + " // Parse command line arguments and increase them by HALO\n", + " n1 = std::stoi(argv[1]) + (2 * kHalfLength);\n", + " n2 = std::stoi(argv[2]) + (2 * kHalfLength);\n", + " n3 = std::stoi(argv[3]) + (2 * kHalfLength);\n", + " n1_block = std::stoi(argv[4]);\n", + " n2_block = std::stoi(argv[5]);\n", + " n3_block = std::stoi(argv[6]);\n", + " num_iterations = std::stoi(argv[7]); \n", + " } catch (...) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Validate input sizes for the grid\n", + " if (ValidateInput(n1, n2, n3, num_iterations)) {\n", + " Usage(argv[0]);\n", + " return 1;\n", + " }\n", + "\n", + " // Create queue and print target info with default selector and in order\n", + " // property\n", + " queue q(default_selector_v, {property::queue::in_order()});\n", + " std::cout << \" Running nd_range GPU version\\n\";\n", + " printTargetInfo(q);\n", + "\n", + " // Compute the total size of grid\n", + " size_t nsize = n1 * n2 * n3;\n", + "\n", + " prev = new float[nsize];\n", + " next = new float[nsize];\n", + " vel = new float[nsize];\n", + "\n", + " // Compute coefficients to be used in wavefield update\n", + " float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1,\n", + " +7.572087e-2, -1.76767677e-2, +3.480962e-3,\n", + " -5.180005e-4, +5.074287e-5, -2.42812e-6};\n", + "\n", + " // Apply the DX, DY and DZ to coefficients\n", + " coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);\n", + " for (auto i = 1; i <= kHalfLength; i++) {\n", + " coeff[i] = coeff[i] / (dxyz * dxyz);\n", + " }\n", + "\n", + " // Initialize arrays and introduce initial conditions (source)\n", + " initialize(prev, next, vel, n1, n2, n3);\n", + "\n", + " auto start = std::chrono::steady_clock::now();\n", + "\n", + " // Invoke the driver function to perform 3D wave propagation offloaded to\n", + " // the device\n", + " iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations);\n", + "\n", + " auto end = std::chrono::steady_clock::now();\n", + " auto time = std::chrono::duration_cast(end - start)\n", + " .count();\n", + " printStats(time, n1, n2, n3, num_iterations); \n", + "\n", + " delete[] prev;\n", + " delete[] next;\n", + " delete[] vel;\n", + "\n", + " return 0;\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the application is created, we can run it from the command line by using few parameters as following:\n", + "src/4_GPU_optimized /4_GPU_optimized 1024 1024 1024 32 8 4 100\n", + "
    \n", + "
  • bin/4_GPU_optimized is the binary
  • \n", + "
  • /1024 1024 1024 32 8 4 100 are the size for the 3 dimensions, increasing it will result in more computation time
  • \n", + "
  • 100 is the number of time steps
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_optimized.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_optimized.sh; else ./run_gpu_optimized.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ISO3DFD GPU Optimizations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* We started from a code version running with standard C++ on the CPU.\n", + "* Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.\n", + "* We identifed the application is bound by Integer opearations.\n", + "* We fixed the indexing to make the code more optimized with reduced INT operations\n", + "* we are going to check how the implementation of L1 cache reusage works\n", + "* The next step, is to to run the Roofline Model and VTune to\n", + " * Check the current optimizations using L1 cache reusage. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Running the GPU Roofline Analysis\n", + "With the offload implemented in 4_GPU_optimized using SYCL, we'll want to run roofline analysis to see the improvements we made to the application and look for more areas where there is room for performance optimization.\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication \n", + "```\n", + "The iso3DFD GPU optimized code can be run using\n", + "```\n", + "advisor --collect=roofline --profile-gpu --project-dir=./../advisor/4_gpu -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_roofline_advisor.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_roofline_advisor.sh; else ./run_gpu_roofline_advisor.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyzing the HTML report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As noted in the below roofline model we can observe that,\n", + "\n", + "* We can observe it is bounded by HBM memory\n", + "* Still lesser INT operations.\n", + "* High HBM traffic\n", + "* Higher Threading occupancy\n", + "\n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Roofline Analysis report overview\n", + "To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. \n", + "\n", + "[View the report in HTML](reports/advisor-report_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/advisor-report.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Generating VTune reports\n", + "Below exercises we use VTune™ analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.\n", + "\n", + "#### Learn more about VTune\n", + "​\n", + "There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/3_GPU_linear 256 256 256 100\n", + "```\n", + "\n", + "```\n", + "vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/3_GPU_linear 256 256 256 100\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html\n", + "```\n", + "\n", + "```\n", + "vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html\n", + "```\n", + "\n", + "[View the report in HTML](reports/output_offload_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_offload_linear.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[View the report in HTML](reports/output_hotspots_linear.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "display(IFrame(src='reports/output_hotspots_linear.html', width=1024, height=768))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build and Run\n", + "Select the cell below and click run ▶ to compile and execute the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! chmod 755 q; chmod 755 run_gpu_linear_vtune.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu_linear_vtune.sh; else ./run_gpu_linear_vtune.sh; fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "* We started from a code version running with standard C++ on the CPU. * Using Intel® Offload Advisor, we determined which loop was a good candidate for offload\n", + "* Using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.\n", + "* In the first iteration We identifed the application is bound by Integer opearations and we fixed the indexing to make it more optimized.\n", + "* The last step we tune by adding more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/q b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/q new file mode 100644 index 0000000000..9bbad910d7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/q @@ -0,0 +1,49 @@ +#!/bin/bash +#========================================== +# Copyright © Intel Corporation +# +# SPDX-License-Identifier: MIT +#========================================== +# Script to submit job in Intel(R) DevCloud +# Version: 0.71 +#========================================== +if [ -z "$1" ]; then + echo "Missing script argument, Usage: ./q run.sh" +elif [ ! -f "$1" ]; then + echo "File $1 does not exist" +else + echo "Job has been submitted to Intel(R) DevCloud and will execute soon." + echo "" + script=$1 + # Remove old output files + rm *.sh.* > /dev/null 2>&1 + # Submit job using qsub + qsub_id=`qsub -l nodes=1:gpu:ppn=2 -d . $script` + job_id="$(cut -d'.' -f1 <<<"$qsub_id")" + # Print qstat output + qstat + # Wait for output file to be generated and display + echo "" + echo -ne "Waiting for Output " + until [ -f $script.o$job_id ]; do + sleep 1 + echo -ne "█" + ((timeout++)) + # Timeout if no output file generated within 60 seconds + if [ $timeout == 70 ]; then + echo "" + echo "" + echo "TimeOut 60 seconds: Job is still queued for execution, check for output file later ($script.o$job_id)" + echo "" + break + fi + done + # Print output and error file content if exist + if [ -n "$(find -name '*.sh.o'$job_id)" ]; then + echo " Done⬇" + cat $script.o$job_id + cat $script.e$job_id + echo "Job Completed in $timeout seconds." + rm *.sh.*$job_id > /dev/null 2>&1 + fi +fi diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/advisor-report.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/advisor-report.html new file mode 100644 index 0000000000..4790e269ab --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/advisor-report.html @@ -0,0 +1,2 @@ +Intel Advisor Report
\ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_hotspots_linear.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_hotspots_linear.html new file mode 100644 index 0000000000..a7ff93f2c7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_hotspots_linear.html @@ -0,0 +1,144 @@ + +
Intel® VTune Profiler 2024.0.0
  • Elapsed Time: + 13.055s
    • GPU Time: + 3.923s
  • Display controller: Intel Corporation Device 0x0bda Device Group: +
    • XVE Array Stalled/Idle: + 89.1% of Elapsed time with GPU busy
      The percentage of time when the XVEs were stalled or idle is high, which has a negative impact on compute-bound applications.
      • This section shows the XVE metrics per stack and per adapter for all the devices in this group.: +
        GPU StackGPU AdapterXVE Array Active(%)XVE Array Stalled(%)XVE Array Idle(%)
        0GPU 113.0%16.8%70.1%
        0GPU 30.0%0.0%100.0%
        0GPU 00.0%0.0%100.0%
        0GPU 20.0%0.0%100.0%
    • GPU L3 Bandwidth Bound: + 2.4% of peak value
    • Occupancy: + 24.7% of peak value
      • This section shows the computing tasks with low occupancy metric for all the devices in this group.: +
        Computing TaskTotal TimeOccupancy(%)SIMD Utilization(%)
        iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()#1}::operator()<sycl::_V1::handler>(, signed char) const::{lambda()#1}3.917s24.7% of peak value0.0%
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/4_GPU_optimized "1024" "1024" "1024" "32" "8" "4" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-04
    • Result Size: + 84.6 MB
    • Collection start time: + 16:13:16 21/03/2024 UTC
    • Collection stop time: + 16:13:29 21/03/2024 UTC
    • Collector Type: + Event-based sampling driver,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_offload_linear.html b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_offload_linear.html new file mode 100644 index 0000000000..2736b84fbb --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/reports/output_offload_linear.html @@ -0,0 +1,142 @@ + +
Intel® VTune Profiler 2024.0.0

Recommendations:

GPU Time, % of Elapsed time: 29.6%
GPU utilization is low. Switch to the for in-depth analysis of host activity. Poor GPU utilization can prevent the application from offloading effectively.
XVE Array Stalled/Idle: 56.6% of Elapsed time with GPU busy
GPU metrics detect some kernel issues. Use GPU Compute/Media Hotspots (preview) to understand how well your application runs on the specified hardware.
  • Elapsed Time: + 13.283s
    • GPU Time, % of Elapsed time: + 29.6%
      GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.
      • GPU Time, % of Elapsed time: +
        GPU AdapterGPU EngineGPU TimeGPU Time, % of Elapsed time(%)
        GPU 1Render and GPGPU3.927s29.6%
      • Top Hotspots when GPU was idle: +
        FunctionModuleCPU Time
        asm_exc_page_faultvmlinux4.544s
        [Skipped stack frame(s)][Unknown]1.836s
        _raw_spin_lockvmlinux1.600s
        asm_exc_int3vmlinux1.400s
        memcmplibc-dynamic.so1.300s
        [Others]N/A18.163s
  • Hottest Host Tasks: +
    Host TaskTask Time% of Elapsed Time(%)Task Count
    zeEventHostSynchronize3.942s29.7%102
    zeCommandListAppendMemoryCopy0.374s2.8%5
    zeModuleCreate0.088s0.7%1
    zeCommandListAppendLaunchKernel0.001s0.0%100
    zeCommandListCreateImmediate0.001s0.0%3
    [Others]0.000s0.0%5
  • Hottest GPU Computing Tasks: +
    Computing TaskTotal TimeExecution Time% of Total Time(%)SIMD Width
    iso3dfd(sycl::_V1::queue&, float*, float*, float*, float*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()#1}::operator()<sycl::_V1::handler>(, signed char) const::{lambda()#1}3.923s3.921s100.0%32
    [Outside any task]0.752s0s0.0%
  • Collection and Platform Info: +
    • Application Command Line: + ./build/src/4_GPU_optimized "1024" "1024" "1024" "32" "8" "4" "100"
    • Operating System: + 5.15.0-100-generic DISTRIB_ID=Ubuntu +DISTRIB_RELEASE=22.04 +DISTRIB_CODENAME=jammy +DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
    • Computer Name: + idc-beta-batch-pvc-node-04
    • Result Size: + 130.0 MB
    • Collection start time: + 16:12:35 21/03/2024 UTC
    • Collection stop time: + 16:12:48 21/03/2024 UTC
    • Collector Type: + Event-based sampling driver,Driverless Perf system-wide sampling,User-mode sampling and tracing
    • CPU: +
      • Name: + Intel(R) Xeon(R) Processor code named Sapphirerapids
      • Frequency: + 2.000 GHz
      • Logical CPU Count: + 224
      • LLC size: + 110.1 MB
    • GPU: +
      • GPU 0: 0:41:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:41:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 1: 0:58:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:58:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 2: 0:154:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:154:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
      • GPU 3: 0:202:0.0 : Display controller: Intel Corporation Device 0x0bda: +
        • BDF: + 0:202:0:0
        • XVE Count: + 448
        • Max XVE Thread Count: + 8
        • Max Core Frequency: + 1.550 GHz
+ diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_roofline.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_roofline.sh new file mode 100644 index 0000000000..11339a8e0f --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_roofline.sh @@ -0,0 +1,10 @@ +#!/bin/bash +#advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu -- ./build/src/3_GPU_linear 256 256 256 100 + +advisor --collect=survey --profile-gpu -project-dir=./roofline_linear -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100 +advisor --collect=tripcounts --profile-gpu --project-dir=./roofline_linear -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100 +advisor --collect=projection --profile-gpu --model-baseline-gpu --project-dir=./roofline_linear +advisor --report=roofline --gpu --project-dir=roofline_linear --report-output=./roofline_linear/roofline_linear.html + + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_vtune.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_vtune.sh new file mode 100644 index 0000000000..da5c802e61 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_linear_vtune.sh @@ -0,0 +1,6 @@ +#!/bin/bash +vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir_linear -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100 +vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots_linear -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100 +vtune -report summary -result-dir vtune_dir_linear -format html -report-output ./reports/output_offload_linear.html +vtune -report summary -result-dir vtune_dir_hotspots_linear -format html -report-output ./reports/output_hotspots_linear.html + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_optimized.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_optimized.sh new file mode 100644 index 0000000000..95f894e7e2 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_optimized.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +rm -rf build +build="$PWD/build" +[ ! -d "$build" ] && mkdir -p "$build" +cd build && +cmake .. && +make run_gpu_optimized diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_roofline_advisor.sh b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_roofline_advisor.sh new file mode 100644 index 0000000000..745b4f2eb8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/run_gpu_roofline_advisor.sh @@ -0,0 +1,4 @@ +#!/bin/bash +advisor --collect=roofline --profile-gpu --project-dir=./../advisor/4_gpu/b8816 -- ./build/src/4_GPU_optimized 1024 1024 1024 8 8 16 100 + + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/1_CPU_only.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/1_CPU_only.cpp new file mode 100644 index 0000000000..97730a9aec --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/1_CPU_only.cpp @@ -0,0 +1,129 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include + +#include "Utils.hpp" + +void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, const size_t n1, + const size_t n2, const size_t n3) { + auto dimn1n2 = n1 * n2; + + // Remove HALO from the end + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + // Calculate start pointers for the row over X dimension + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + // Iterate over X + for (auto ix = kHalfLength; ix < n1_end; ix++) { + // Calculate values for each cell + float value = ptr_prev[ix] * coeff[0]; + for (int i = 1; i <= kHalfLength; i++) { + value += + coeff[i] * + (ptr_prev[ix + i] + ptr_prev[ix - i] + + ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] + + ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]); + } + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void iso3dfd(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it++) { + iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3); + // Swap the pointers for always having current values in prev array + std::swap(next, prev); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + std::cout << "Running on CPU serial version\n"; + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation 1 thread serial + // version + iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/2_GPU_basic.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/2_GPU_basic.cpp new file mode 100644 index 0000000000..3571f98bfc --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/2_GPU_basic.cpp @@ -0,0 +1,153 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 3D SYCL range for buffers which include HALO + range<3> buffer_range(n1, n2, n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate values for each cell + float value = prev_acc[i][j][k] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] + + prev_acc[i][j + x][k] + prev_acc[i][j - x][k] + + prev_acc[i + x][j][k] + prev_acc[i - x][j][k]); + } + next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] + + value * vel_acc[i][j][k]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running GPU basic offload version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/3_GPU_linear.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/3_GPU_linear.cpp new file mode 100644 index 0000000000..ad780e226d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/3_GPU_linear.cpp @@ -0,0 +1,174 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + //coeff += (16 - b1); + q.memcpy(coeff, coeff, sizeof(float)*grid_size); + q.wait(); + + //auto local_nd_range = range(1, n2_block, n1_block); + //auto global_nd_range = range((n3 - 2 * kHalfLength)/n3_block, (n2 - 2 * kHalfLength)/n2_block, + //(n1 - 2 * kHalfLength)); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*b3+n3_block-1)/n3_block*n3_block,(n2-2*b2+n2_block-1)/n2_block*n2_block,n1_block); + //auto global_nd_range = range<3>((n3-2*kHalfLength),(n2-2*),n1_block); + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + //[[intel::reqd_sub_group_size(32)]] + //[[intel::kernel_args_restrict]] + { + const int iz = b3 + item.get_global_id(0); + const int iy = b2 + item.get_global_id(1); + if (iz < n3 - b3 && iy < n2 - b2) + for (int ix = b1+item.get_global_id(2); ix < n1 - b1; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/4_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/4_GPU_optimized.cpp new file mode 100644 index 0000000000..ef1664d09c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/4_GPU_optimized.cpp @@ -0,0 +1,155 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - kHalfLength); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - kHalfLength); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - kHalfLength); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + q.wait(); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*kHalfLength+n3_block-1)/n3_block*n3_block,(n2-2*kHalfLength+n2_block-1)/n2_block*n2_block,n1_block); + + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + { + const int iz = kHalfLength + item.get_global_id(0); + const int iy = kHalfLength + item.get_global_id(1); + if (iz < n3 - kHalfLength && iy < n2 - kHalfLength) + for (int ix = kHalfLength+item.get_global_id(2); ix < n1 - kHalfLength; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - kHalfLength),q); + sycl::free(prev - (16 - kHalfLength),q); + sycl::free(vel - (16 - kHalfLength),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running nd_range GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/CMakeLists.txt new file mode 100644 index 0000000000..eeefc255e8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/CMakeLists.txt @@ -0,0 +1,29 @@ +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fsycl --std=c++17") +# Set default build type to RelWithDebInfo if not specified +if (NOT CMAKE_BUILD_TYPE) + message (STATUS "Default CMAKE_BUILD_TYPE not set using Release with Debug Info") + set (CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE + STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel" + FORCE) +endif() + +set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}") + +add_executable(1_CPU_only 1_CPU_only.cpp) +add_executable(2_GPU_basic 2_GPU_basic.cpp) +add_executable(3_GPU_linear 3_GPU_linear.cpp) +add_executable(4_GPU_optimized 4_GPU_optimized.cpp) + +target_link_libraries(1_CPU_only OpenCL sycl) +target_link_libraries(2_GPU_basic OpenCL sycl) +target_link_libraries(3_GPU_linear OpenCL sycl) +target_link_libraries(4_GPU_optimized OpenCL sycl) + +add_custom_target(run_all 1_CPU_only 1024 1024 1024 100 + COMMAND 2_GPU_basic 1024 1024 1024 100 + COMMAND 3_GPU_linear 1024 1024 1024 100 + COMMAND 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_cpu 1_CPU_only 1024 1024 1024 100) +add_custom_target(run_gpu_basic 2_GPU_basic 1024 1024 1024 100) +add_custom_target(run_gpu_linear 3_GPU_linear 1024 1024 1024 100) +add_custom_target(run_gpu_optimized 4_GPU_optimized 1024 1024 1024 32 4 8 100) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Iso3dfd.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Iso3dfd.hpp new file mode 100644 index 0000000000..e3487fa0cf --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Iso3dfd.hpp @@ -0,0 +1,21 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +constexpr size_t kHalfLength = 8; +constexpr float dxyz = 50.0f; +constexpr float dt = 0.002f; + +#define STENCIL_LOOKUP(ir) \ + (coeff[ir] * ((ptr_prev[ix + ir] + ptr_prev[ix - ir]) + \ + (ptr_prev[ix + ir * n1] + ptr_prev[ix - ir * n1]) + \ + (ptr_prev[ix + ir * dimn1n2] + ptr_prev[ix - ir * dimn1n2]))) + + +#define KERNEL_STENCIL_LOOKUP(x) \ + coeff[x] * (tab[l_idx + x] + tab[l_idx - x] + front[x] + back[x - 1] \ + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x]) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Utils.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Utils.hpp new file mode 100644 index 0000000000..98d4a6e12c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/04_ISO3DFD_GPU_Optimized/src/Utils.hpp @@ -0,0 +1,259 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +#include +#include + +#include "Iso3dfd.hpp" + +void Usage(const std::string& programName, bool usedNd_ranges = false) { + std::cout << "--------------------------------------\n"; + std::cout << " Incorrect parameters \n"; + std::cout << " Usage: "; + std::cout << programName << " n1 n2 n3 Iterations"; + + if (usedNd_ranges) std::cout << " kernel_iterations n2_WGS n3_WGS"; + + std::cout << " [verify]\n\n"; + std::cout << " n1 n2 n3 : Grid sizes for the stencil\n"; + std::cout << " Iterations : No. of timesteps.\n"; + + if (usedNd_ranges) { + std::cout + << " kernel_iterations : No. of cells calculated by one kernel\n"; + std::cout << " n2_WGS n3_WGS : n2 and n3 work group sizes\n"; + } + std::cout + << " [verify] : Optional: Compare results with CPU version\n"; + std::cout << "--------------------------------------\n"; + std::cout << "--------------------------------------\n"; +} + +bool ValidateInput(size_t n1, size_t n2, size_t n3, size_t num_iterations, + size_t kernel_iterations = -1, size_t n2_WGS = kHalfLength, + size_t n3_WGS = kHalfLength) { + if ((n1 < kHalfLength) || (n2 < kHalfLength) || (n3 < kHalfLength) || + (n2_WGS < kHalfLength) || (n3_WGS < kHalfLength)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid grid size : n1, n2, n3, n2_WGS, n3_WGS should be " + "greater than " + << kHalfLength << "\n"; + return true; + } + + if ((n2 < n2_WGS) || (n3 < n3_WGS)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid work group size : n2 should be greater than n2_WGS " + "and n3 greater than n3_WGS\n"; + return true; + } + + if (((n2 - 2 * kHalfLength) % n2_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n2 should be multiple of n2_WGS - " + << n2_WGS << "\n"; + return true; + } + if (((n3 - 2 * kHalfLength) % n3_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n3 should be multiple of n3_WGS - " + << n3_WGS << "\n"; + return true; + } + if (((n1 - 2 * kHalfLength) % kernel_iterations) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n1 should be multiple of " + "kernel_iterations - " + << kernel_iterations << "\n"; + return true; + } + + return false; +} + +bool CheckWorkGroupSize(sycl::queue& q, unsigned int n2_WGS, + unsigned int n3_WGS) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + if ((max_block_size > 1) && (n2_WGS * n3_WGS > max_block_size)) { + std::cout << "ERROR: Invalid block sizes: n2_WGS * n3_WGS should be " + "less than or equal to " + << max_block_size << "\n"; + return true; + } + + return false; +} + +void printTargetInfo(sycl::queue& q) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + auto max_exec_unit_count = + device.get_info(); + + std::cout << " Running on " << device.get_info() + << "\n"; + std::cout << " The Device Max Work Group Size is : " << max_block_size + << "\n"; + std::cout << " The Device Max EUCount is : " << max_exec_unit_count << "\n"; +} + +void initialize(float* ptr_prev, float* ptr_next, float* ptr_vel, size_t n1, + size_t n2, size_t n3) { + auto dim2 = n2 * n1; + + for (auto i = 0; i < n3; i++) { + for (auto j = 0; j < n2; j++) { + auto offset = i * dim2 + j * n1; + + for (auto k = 0; k < n1; k++) { + ptr_prev[offset + k] = 0.0f; + ptr_next[offset + k] = 0.0f; + ptr_vel[offset + k] = + 2250000.0f * dt * dt; // Integration of the v*v and dt*dt here + } + } + } + // Then we add a source + float val = 1.f; + for (auto s = 5; s >= 0; s--) { + for (auto i = n3 / 2 - s; i < n3 / 2 + s; i++) { + for (auto j = n2 / 4 - s; j < n2 / 4 + s; j++) { + auto offset = i * dim2 + j * n1; + for (auto k = n1 / 4 - s; k < n1 / 4 + s; k++) { + ptr_prev[offset + k] = val; + } + } + } + val *= 10; + } +} + +void printStats(double time, size_t n1, size_t n2, size_t n3, + size_t num_iterations) { + float throughput_mpoints = 0.0f, mflops = 0.0f, normalized_time = 0.0f; + double mbytes = 0.0f; + + normalized_time = (double)time / num_iterations; + throughput_mpoints = ((n1 - 2 * kHalfLength) * (n2 - 2 * kHalfLength) * + (n3 - 2 * kHalfLength)) / + (normalized_time * 1e3f); + mflops = (7.0f * kHalfLength + 5.0f) * throughput_mpoints; + mbytes = 12.0f * throughput_mpoints; + + std::cout << "--------------------------------------\n"; + std::cout << "time : " << time / 1e3f << " secs\n"; + std::cout << "throughput : " << throughput_mpoints << " Mpts/s\n"; + std::cout << "flops : " << mflops / 1e3f << " GFlops\n"; + std::cout << "bytes : " << mbytes / 1e3f << " GBytes/s\n"; + std::cout << "\n--------------------------------------\n"; + std::cout << "\n--------------------------------------\n"; +} + +bool WithinEpsilon(float* output, float* reference, const size_t dim_x, + const size_t dim_y, const size_t dim_z, + const unsigned int radius, const int zadjust = 0, + const float delta = 0.01f) { + std::ofstream error_file; + error_file.open("error_diff.txt"); + + bool error = false; + double norm2 = 0; + + for (size_t iz = 0; iz < dim_z; iz++) { + for (size_t iy = 0; iy < dim_y; iy++) { + for (size_t ix = 0; ix < dim_x; ix++) { + if (ix >= radius && ix < (dim_x - radius) && iy >= radius && + iy < (dim_y - radius) && iz >= radius && + iz < (dim_z - radius + zadjust)) { + float difference = fabsf(*reference - *output); + norm2 += difference * difference; + if (difference > delta) { + error = true; + error_file << " ERROR: " << ix << ", " << iy << ", " << iz << " " + << *output << " instead of " << *reference + << " (|e|=" << difference << ")\n"; + } + } + ++output; + ++reference; + } + } + } + + error_file.close(); + norm2 = sqrt(norm2); + if (error) std::cout << "error (Euclidean norm): " << norm2 << "\n"; + return error; +} + +void inline iso3dfdCPUIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, + const size_t n1, const size_t n2, + const size_t n3) { + auto dimn1n2 = n1 * n2; + + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + for (auto ix = kHalfLength; ix < n1_end; ix++) { + float value = ptr_prev[ix] * coeff[0]; + value += STENCIL_LOOKUP(1); + value += STENCIL_LOOKUP(2); + value += STENCIL_LOOKUP(3); + value += STENCIL_LOOKUP(4); + value += STENCIL_LOOKUP(5); + value += STENCIL_LOOKUP(6); + value += STENCIL_LOOKUP(7); + value += STENCIL_LOOKUP(8); + + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void CalculateReference(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it += 1) { + iso3dfdCPUIteration(next, prev, vel, coeff, n1, n2, n3); + std::swap(next, prev); + } +} + +void VerifyResult(float* prev, float* next, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + std::cout << "Running CPU version for result comparasion: "; + auto nsize = n1 * n2 * n3; + float* temp = new float[nsize]; + memcpy(temp, prev, nsize * sizeof(float)); + initialize(prev, next, vel, n1, n2, n3); + CalculateReference(next, prev, vel, coeff, n1, n2, n3, nreps); + bool error = WithinEpsilon(temp, prev, n1, n2, n3, kHalfLength, 0, 0.1f); + if (error) { + std::cout << "Final wavefields from SYCL device and CPU are not " + << "equivalent: Fail\n"; + } else { + std::cout << "Final wavefields from SYCL device and CPU are equivalent:" + << " Success\n"; + } + delete[] temp; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/CMakeLists.txt new file mode 100644 index 0000000000..e0bded3dae --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/CMakeLists.txt @@ -0,0 +1,4 @@ +cmake_minimum_required (VERSION 3.4) +set(CMAKE_CXX_COMPILER "icpx") +project (Iso3DFD) +add_subdirectory (src) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/README.md b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/README.md new file mode 100644 index 0000000000..7932525cef --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/README.md @@ -0,0 +1,139 @@ +## Title +The `guided iso3dfd GPUOptimization` sample demonstrates how to use the Intel® oneAPI Base Toolkit (Base Kit) and tools found in the Base Kit to optimize code for GPU offload. The ISO3DFD sample refers to Three-Dimensional Finite-Difference Wave Propagation in Isotropic Media; it is a three-dimensional stencil to simulate a wave propagating in a 3D isotropic medium. + +This sample follows the workflow found in [Optimize Your GPU Application with the Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/gpu-optimization-workflow.html#gs.101gmt2). + +For comprehensive instructions, see the [Intel® oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) and search based on relevant terms noted in the comments. + +| Property | Description +|:--- |:--- +| What you will learn | How to offload the computation to GPU and iteratively optimize the application performance using Intel® oneAPI DPC++/C++ Compiler +| Time to complete | 50 minutes + +## Purpose + +This sample starts with a CPU oriented application and shows how to use SYCL* and the oneAPI tools to offload regions of the code to the target system GPU. The sample relies heavily on use of the Intel Advisor, which is a design and analysis tool for developing performant code. We'll use Intel® Advisor to conduct offload modeling and identify code regions that will benefit the most from GPU offload. Once the initial offload is complete, we'll walk through how to develop an optimization strategy by iteratively optimizing the code based on opportunities exposed Intel® Advisor to run roofline analysis. + +ISO3DFD is a finite difference stencil kernel for solving the 3D acoustic isotropic wave equation, which can be used as a proxy for propagating a seismic wave. The sample implements kernels as 16th order in space, with symmetric coefficients, and 2nd order in time scheme without boundary conditions. + +The sample includes four different versions of the iso3dfd project. + +- `1_CPU_only.cpp`: basic serial CPU implementation. +- `2_GPU_basic.cpp`: initial GPU offload version using SYCL. +- `3_GPU_linear.cpp`: first compute optimization by changing indexing patern. +- `4_GPU_optimized.cpp`: additional optimizations for memory bound. + + +## Prerequisites +| Optimized for | Description +|:--- |:--- +| OS | Linux* Ubuntu* 18.04
Windows* 10 +| Hardware | Skylake with GEN9 or newer +| Software | Intel® oneAPI DPC++/C++ Compiler
Intel® Advisor + + +## Key Implementation Details + +The basic SYCL* standards implemented in the code include the use of the following: + +- SYCL* local buffers and accessors (declare local memory buffers and accessors to be accessed and managed by each workgroup) +- Code for Shared Local Memory (SLM) optimizations +- SYCL* kernels (including parallel_for function and nd-range<3> objects) +- SYCL* queues (including exception handlers) + + +## Building the `ISO3DFD` Program for CPU and GPU + +> **Note**: If you have not already done so, set up your CLI +> environment by sourcing the `setvars` script located in +> the root of your oneAPI installation. +> +> Linux: +> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` +> - For private installations: `. ~/intel/oneapi/setvars.sh` +> +> Windows: +> - `C:\Program Files(x86)\Intel\oneAPI\setvars.bat` +> +>For more information on environment variables, see Use the setvars Script for [Linux or macOS](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html), or [Windows](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-windows.html). + + +> **Note**: For GPU Analysis on Linux* enable collecting GPU hardware metrics by setting the value of dev.i915 perf_stream_paranoidsysctl option to 0 as follows. This command makes a temporary change that is lost after reboot: +> +> `sudo sysctl -w dev.i915.perf_stream_paranoid=0` +> +>To make a permanent change, enter: +> +> `sudo echo dev.i915.perf_stream_paranoid=0 > /etc/sysctl.d/60-mdapi.conf` + +### Running Samples in Intel® DevCloud + +If running a sample in the Intel® DevCloud, you must specify the compute node (CPU, GPU, FPGA) and whether to run in batch or interactive mode. For more information, see the Intel® oneAPI Base Toolkit [Get Started Guide](https://devcloud.intel.com/oneapi/get_started/). + +### On Linux* +Perform the following steps: +1. Build the program using the following `cmake` commands. + ``` + $ mkdir build + $ cd build + $ cmake .. + $ make + ``` + +2. Run the program. + ``` + $ make run_all + ``` +#### Training Modules + +| Modules | Description +|---|---| +|[__ISO3DFD and offload Advsior analysis running on CPU__](01_ISO3DFD_CPU/iso3dfd_Offload_Advisor_Analysis.ipynb) | + Provide performance analysis/projections of the application and run then offload modeling on the CPU version of the application| +|[__ISO3DFD and Implementation using SYCL offloading to a GPU__](02_ISO3DFD_GPU_Basics/iso3dfd_gpu_basice.ipynb) |+ how to offload the most profitable loops in the code on the GPU using SYCL| +|[__ISO3DFD on a GPU and Index computations__](03_ISO3DFD_GPU_Linear/iso3dfd_gpu_linear.ipynb)|+ Write kernels by reducing index calculations by changing how we calculate indices.we can change the 3D indexing to 1D| +|[__ISO3DFD and Implementation using SYCL offloading to a GPU__](iso3dfd_gpu_optimized/3dfd_gpu_basice.ipynb)|+ change the kernel to nd_range; we learn to offload more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache.| + +#### Content Structure +Each module folder has a Jupyter Notebook file (`*.ipynb`), this can be opened in Jupyter Lab to view the training content, edit code and compile/run. Along with the Notebook file, there is a `lab` and a `src` folder with SYCL source code for samples used in the Notebook. The module folder also has `run_*.sh` files, which can be used in shell terminal to compile and run each sample code. + +- `01_{Module_Name}` + - `lab` + - `{sample_code_name}.cpp` - _(sample code editable via Jupyter Notebook)_ + - `src` + - `{sample_code_name}.cpp` - _(copy of sample code)_ + - `01_{Module_Name}.ipynb` - _(Jupyter Notebook with training content and sample codes)_ + - `run_{sample_code_name}.sh` - _(script to compile and run {sample_code_name}.cpp)_ + - `License.txt` + - `Readme.md` + + +## Install Directions + +The training content can be accessed locally on the computer after installing necessary tools, or you can directly access using Intel DevCloud without any installation necessary. + +#### Access using Intel DevCloud + +The Jupyter notebooks are tested and can be run on Intel DevCloud without any installation necessary, below are the steps to access these Jupyter notebooks on Intel DevCloud: +1. Register on [Intel DevCloud](https://devcloud.intel.com/oneapi) +2. Login, Get Started and Launch Jupyter Lab +3. Open Terminal in Jupyter Lab and git clone the repo and access the Notebooks + +#### Local Installation of oneAPI Tools and JupyterLab + +The Jupyter Notebooks can be downloaded locally to computer and accessed: +- Install Intel oneAPI Base Toolkit on local computer: [Installation Guide](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html) +- Install Jupyter Lab on local computer: [Installation Guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html) +- git clone the repo and access the Notebooks using Jupyter Lab + +#### Local Installation of oneAPI Tools and use command line + +The Jupyter Notebooks can be viewed on Github and you can run the code on command line: +- Install Intel oneAPI Base Toolkit on local computer (linux): [Installation Guide](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html) +- git clone the repo +- open command line terminal and use the `run_*.sh` script in each module to compile and run code. + +## License +Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. + +Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) + diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/Welcome.ipynb b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/Welcome.ipynb new file mode 100644 index 0000000000..481331e894 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/Welcome.ipynb @@ -0,0 +1,82 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "# Guided ISO3DFD GPU Optimization Modules" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "The concepts build on top of each other introducing and reinforcing the concepts of ISO3DFD and optimizations using oneAPI and SYCL Programming.\n", + "\n", + "## Module 1 - [ISO3DFD and offload Advsior analysis running on CPU ](01_ISO3DFD_CPU/iso3dfd_Offload_Advisor_Analysis.ipynb)\n", + "This module introduces you to ISO3DFD and usage Intel® Advisor analysis tool to provide performance analysis/projections of the application and run then offload modeling on the CPU version of the application to identify code regions that are good opportunities for GPU offload\n", + "\n", + "## Module 2 - [ISO3DFD and Implementation using SYCL and offloading to a GPU ](02_ISO3DFD_GPU_Basic/iso3dfd_gpu_basic.ipynb)\n", + "This module implements the basic offload of the iso3dfd function to an available GPU on the system. We have to create queue and change the iso3dfd function. Instead of iterating over all the cells in the memory, we will create buffers and accessors to move the data to the GPU when needed and create a kernel which will do the calculations, each kernel will calculate one cell.\n", + "\n", + "\n", + "## Module 3 - [ISO3DFD on a GPU and Index computations ](03_ISO3DFD_GPU_Linear/iso3dfd_gpu_linear.ipynb)\n", + "In this notebook, we'll address the problem being compute bound in kernels by reducing index calculations by changing how we calculate indices. We can change the 3D indexing to 1D. we learn how flatten the buffers, change how we calculate location in the memory for each kernel, and change how we are accessing the neighbors\n", + "\n", + "## Module 4 - [ISO3DFD nd_range implementattion to a GPU ](04_ISO3DFD_GPU_Optimized/iso3dfd_gpu_optimized.ipynb)\n", + "In this module we to change the kernel to nd_range; we learn to offload more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache. This latest iteration includes new arguments for the nd_range size and kernel iterations, change back the kernel_range to 3d and introduce local_range for work group size, change parallel_for to use nd_range update how indices are calculated with introduction of dedicated cache reuse memory.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "calc(100% - 180px)", + "left": "10px", + "top": "150px", + "width": "384.391px" + }, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/sample.json b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/sample.json new file mode 100644 index 0000000000..35c790e80e --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/sample.json @@ -0,0 +1,24 @@ +{ + "guid": "7bd9f396-8880-4539-8301-71084626b7db", + "name": "Guided iso3dfd GPU optimization", + "categories": ["Toolkit/oneAPI Direct Programming/C++SYCL/Structured Grids"], + "description": "Step-by-step GPU optimization guide with Intel Advisor and ISO3DFD sample", + "toolchain": [ "dpcpp" ], + "targetDevice": [ "CPU", "GPU" ], + "gpuRequired": ["gen9", "pvc"], + "languages": [ { "cpp": {} } ], + "os": [ "linux" ], + "builder": [ "ide", "cmake" ], + "ciTests": { + "linux": [{ + "steps": [ + "mkdir build", + "cd build", + "cmake ..", + "make", + "make run_all" + ] + }] + }, + "expertise": "Code Optimization" + } diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/1_CPU_only.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/1_CPU_only.cpp new file mode 100644 index 0000000000..4465d53864 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/1_CPU_only.cpp @@ -0,0 +1,129 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include + +#include "Utils.hpp" + +void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, const size_t n1, + const size_t n2, const size_t n3) { + auto dimn1n2 = n1 * n2; + + // Remove HALO from the end + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + // Calculate start pointers for the row over X dimension + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + // Iterate over X + for (auto ix = kHalfLength; ix < n1_end; ix++) { + // Calculate values for each cell + float value = ptr_prev[ix] * coeff[0]; + for (int i = 1; i <= kHalfLength; i++) { + value += + coeff[i] * + (ptr_prev[ix + i] + ptr_prev[ix - i] + + ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] + + ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]); + } + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void iso3dfd(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it++) { + iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3); + // Swap the pointers for always having current values in prev array + std::swap(next, prev); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + std::cout << "Running on CPU serial version\n"; + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation 1 thread serial + // version + iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + + printStats(time, n1, n2, n3, num_iterations); + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/2_GPU_basic.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/2_GPU_basic.cpp new file mode 100644 index 0000000000..3571f98bfc --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/2_GPU_basic.cpp @@ -0,0 +1,153 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 3D SYCL range for buffers which include HALO + range<3> buffer_range(n1, n2, n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it += 1) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> idx) { + // Start of device code + // Add offsets to indices to exclude HALO + int i = idx[0] + kHalfLength; + int j = idx[1] + kHalfLength; + int k = idx[2] + kHalfLength; + + // Calculate values for each cell + float value = prev_acc[i][j][k] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] + + prev_acc[i][j + x][k] + prev_acc[i][j - x][k] + + prev_acc[i + x][j][k] + prev_acc[i - x][j][k]); + } + next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] + + value * vel_acc[i][j][k]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running GPU basic offload version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/3_GPU_linear.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/3_GPU_linear.cpp new file mode 100644 index 0000000000..553b38a47d --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/3_GPU_linear.cpp @@ -0,0 +1,157 @@ +//============================================================== +// Copyright 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + // Create 3D SYCL range for kernels which not include HALO + range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength, + n3 - 2 * kHalfLength); + // Create 1D SYCL range for buffers which include HALO + range<1> buffer_range(n1 * n2 * n3); + // Create buffers using SYCL class buffer + buffer next_buf(next, buffer_range); + buffer prev_buf(prev, buffer_range); + buffer vel_buf(vel, buffer_range); + buffer coeff_buf(coeff, range(kHalfLength + 1)); + + for (auto it = 0; it < nreps; it++) { + // Submit command group for execution + q.submit([&](handler& h) { + // Create accessors + accessor next_acc(next_buf, h); + accessor prev_acc(prev_buf, h); + accessor vel_acc(vel_buf, h, read_only); + accessor coeff_acc(coeff_buf, h, read_only); + + // Send a SYCL kernel(lambda) to the device for parallel execution + // Each kernel runs single cell + h.parallel_for(kernel_range, [=](id<3> nidx) { + // Start of device code + // Add offsets to indices to exclude HALO + int n2n3 = n2 * n3; + int i = nidx[0] + kHalfLength; + int j = nidx[1] + kHalfLength; + int k = nidx[2] + kHalfLength; + + // Calculate linear index for each cell + int idx = i * n2n3 + j * n3 + k; + + // Calculate values for each cell + float value = prev_acc[idx] * coeff_acc[0]; +#pragma unroll(8) + for (int x = 1; x <= kHalfLength; x++) { + value += + coeff_acc[x] * (prev_acc[idx + x] + prev_acc[idx - x] + + prev_acc[idx + x * n3] + prev_acc[idx - x * n3] + + prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]); + } + next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] + + value * vel_acc[idx]; + // End of device code + }); + }); + + // Swap the buffers for always having current values in prev buffer + std::swap(next_buf, prev_buf); + } +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + num_iterations = std::stoi(argv[4]); + if (argc > 5) verify = true; + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/4_GPU_optimized.cpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/4_GPU_optimized.cpp new file mode 100644 index 0000000000..99dd9d85b8 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/4_GPU_optimized.cpp @@ -0,0 +1,171 @@ +//============================================================== +// Copyright © Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include + +#include "Utils.hpp" + +using namespace sycl; + +void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff, + const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block, + const size_t nIterations) { + auto nx = n1; + auto nxy = n1*n2; + auto grid_size = nxy*n3; + + auto b1 = kHalfLength; + auto b2 = kHalfLength; + auto b3 = kHalfLength; + + auto next = sycl::aligned_alloc_device(64, grid_size + 16, q); + next += (16 - b1); + q.memcpy(next, ptr_next, sizeof(float)*grid_size); + auto prev = sycl::aligned_alloc_device(64, grid_size + 16, q); + prev += (16 - b1); + q.memcpy(prev, ptr_prev, sizeof(float)*grid_size); + auto vel = sycl::aligned_alloc_device(64, grid_size + 16, q); + vel += (16 - b1); + q.memcpy(vel, ptr_vel, sizeof(float)*grid_size); + //auto coeff = sycl::aligned_alloc_device(64, grid_size + 16, q); + auto coeff = sycl::aligned_alloc_device(64, kHalfLength+1 , q); + q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1)); + //coeff += (16 - b1); + //q.memcpy(coeff, coeff, sizeof(float)*grid_size); + q.wait(); + + //auto local_nd_range = range(1, n2_block, n1_block); + //auto global_nd_range = range((n3 - 2 * kHalfLength)/n3_block, (n2 - 2 * kHalfLength)/n2_block, + //(n1 - 2 * kHalfLength)); + + auto local_nd_range = range<3>(n3_block,n2_block,n1_block); + auto global_nd_range = range<3>((n3-2*b3+n3_block-1)/n3_block*n3_block,(n2-2*b2+n2_block-1)/n2_block*n2_block,n1_block); + + + for (auto i = 0; i < nIterations; i += 1) { + q.submit([&](auto &h) { + h.parallel_for( + nd_range(global_nd_range, local_nd_range), [=](auto item) + //[[intel::reqd_sub_group_size(32)]] + //[[intel::kernel_args_restrict]] + { + const int iz = b3 + item.get_global_id(0); + const int iy = b2 + item.get_global_id(1); + if (iz < n3 - b3 && iy < n2 - b2) + for (int ix = b1+item.get_global_id(2); ix < n1 - b1; ix += n1_block) + { + auto gid = ix + iy*nx + iz*nxy; + float *pgid = prev+gid; + auto value = coeff[0] * pgid[0]; +#pragma unroll(kHalfLength) + for (auto iter = 1; iter <= kHalfLength; iter++) + value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]); + next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid]; + } + }); + }).wait(); + std::swap(next, prev); + } + q.memcpy(ptr_prev, prev, sizeof(float)*grid_size); + + sycl::free(next - (16 - b1),q); + sycl::free(prev - (16 - b1),q); + sycl::free(vel - (16 - b1),q); + sycl::free(coeff,q); + +} + +int main(int argc, char* argv[]) { + // Arrays used to update the wavefield + float* prev; + float* next; + // Array to store wave velocity + float* vel; + + // Variables to store size of grids and number of simulation iterations + size_t n1, n2, n3; + size_t n1_block, n2_block, n3_block; + size_t num_iterations; + + // Flag to verify results with CPU version + bool verify = false; + + if (argc < 5) { + Usage(argv[0]); + return 1; + } + + try { + // Parse command line arguments and increase them by HALO + n1 = std::stoi(argv[1]) + (2 * kHalfLength); + n2 = std::stoi(argv[2]) + (2 * kHalfLength); + n3 = std::stoi(argv[3]) + (2 * kHalfLength); + n1_block = std::stoi(argv[4]); + n2_block = std::stoi(argv[5]); + n3_block = std::stoi(argv[6]); + num_iterations = std::stoi(argv[7]); + } catch (...) { + Usage(argv[0]); + return 1; + } + + // Validate input sizes for the grid + if (ValidateInput(n1, n2, n3, num_iterations)) { + Usage(argv[0]); + return 1; + } + + // Create queue and print target info with default selector and in order + // property + queue q(default_selector_v, {property::queue::in_order()}); + std::cout << " Running linear indexed GPU version\n"; + printTargetInfo(q); + + // Compute the total size of grid + size_t nsize = n1 * n2 * n3; + + prev = new float[nsize]; + next = new float[nsize]; + vel = new float[nsize]; + + // Compute coefficients to be used in wavefield update + float coeff[kHalfLength + 1] = {-3.0548446, +1.7777778, -3.1111111e-1, + +7.572087e-2, -1.76767677e-2, +3.480962e-3, + -5.180005e-4, +5.074287e-5, -2.42812e-6}; + + // Apply the DX, DY and DZ to coefficients + coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz); + for (auto i = 1; i <= kHalfLength; i++) { + coeff[i] = coeff[i] / (dxyz * dxyz); + } + + // Initialize arrays and introduce initial conditions (source) + initialize(prev, next, vel, n1, n2, n3); + + auto start = std::chrono::steady_clock::now(); + + // Invoke the driver function to perform 3D wave propagation offloaded to + // the device + iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations); + + auto end = std::chrono::steady_clock::now(); + auto time = std::chrono::duration_cast(end - start) + .count(); + printStats(time, n1, n2, n3, num_iterations); + + // Verify result with the CPU serial version + if (verify) { + VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations); + } + + delete[] prev; + delete[] next; + delete[] vel; + + return 0; +} diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/CMakeLists.txt b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/CMakeLists.txt new file mode 100644 index 0000000000..93f5af83b7 --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/CMakeLists.txt @@ -0,0 +1,29 @@ +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fsycl --std=c++17") +# Set default build type to RelWithDebInfo if not specified +if (NOT CMAKE_BUILD_TYPE) + message (STATUS "Default CMAKE_BUILD_TYPE not set using Release with Debug Info") + set (CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE + STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel" + FORCE) +endif() + +set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}") + +add_executable(1_CPU_only 1_CPU_only.cpp) +add_executable(2_GPU_basic 2_GPU_basic.cpp) +add_executable(3_GPU_linear 3_GPU_linear.cpp) +add_executable(4_GPU_optimized 4_GPU_optimized.cpp) + +target_link_libraries(1_CPU_only OpenCL sycl) +target_link_libraries(2_GPU_basic OpenCL sycl) +target_link_libraries(3_GPU_linear OpenCL sycl) +target_link_libraries(4_GPU_optimized OpenCL sycl) + +add_custom_target(run_all 1_CPU_only 256 256 256 20 + COMMAND 2_GPU_basic 1024 1024 1024 100 + COMMAND 3_GPU_linear 1024 1024 1024 100 + COMMAND 4_GPU_optimized 1024 1024 1024 32 4 8 100) +add_custom_target(run_cpu 1_CPU_only 1024 1024 1024 100) +add_custom_target(run_gpu_basic 2_GPU_basic 1024 1024 1024 100) +add_custom_target(run_gpu_linear 3_GPU_linear 1024 1024 1024 100) +add_custom_target(run_gpu_optimized 4_GPU_optimized 1024 1024 1024 32 4 8 100) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Iso3dfd.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Iso3dfd.hpp new file mode 100644 index 0000000000..e3487fa0cf --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Iso3dfd.hpp @@ -0,0 +1,21 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +constexpr size_t kHalfLength = 8; +constexpr float dxyz = 50.0f; +constexpr float dt = 0.002f; + +#define STENCIL_LOOKUP(ir) \ + (coeff[ir] * ((ptr_prev[ix + ir] + ptr_prev[ix - ir]) + \ + (ptr_prev[ix + ir * n1] + ptr_prev[ix - ir * n1]) + \ + (ptr_prev[ix + ir * dimn1n2] + ptr_prev[ix - ir * dimn1n2]))) + + +#define KERNEL_STENCIL_LOOKUP(x) \ + coeff[x] * (tab[l_idx + x] + tab[l_idx - x] + front[x] + back[x - 1] \ + + tab[l_idx + l_n3 * x] + tab[l_idx - l_n3 * x]) \ No newline at end of file diff --git a/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Utils.hpp b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Utils.hpp new file mode 100644 index 0000000000..98d4a6e12c --- /dev/null +++ b/DirectProgramming/C++SYCL/Jupyter/C++-to-sycl-migration/src/Utils.hpp @@ -0,0 +1,259 @@ +//============================================================== +// Copyright � 2022 Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= + +#pragma once + +#include +#include + +#include "Iso3dfd.hpp" + +void Usage(const std::string& programName, bool usedNd_ranges = false) { + std::cout << "--------------------------------------\n"; + std::cout << " Incorrect parameters \n"; + std::cout << " Usage: "; + std::cout << programName << " n1 n2 n3 Iterations"; + + if (usedNd_ranges) std::cout << " kernel_iterations n2_WGS n3_WGS"; + + std::cout << " [verify]\n\n"; + std::cout << " n1 n2 n3 : Grid sizes for the stencil\n"; + std::cout << " Iterations : No. of timesteps.\n"; + + if (usedNd_ranges) { + std::cout + << " kernel_iterations : No. of cells calculated by one kernel\n"; + std::cout << " n2_WGS n3_WGS : n2 and n3 work group sizes\n"; + } + std::cout + << " [verify] : Optional: Compare results with CPU version\n"; + std::cout << "--------------------------------------\n"; + std::cout << "--------------------------------------\n"; +} + +bool ValidateInput(size_t n1, size_t n2, size_t n3, size_t num_iterations, + size_t kernel_iterations = -1, size_t n2_WGS = kHalfLength, + size_t n3_WGS = kHalfLength) { + if ((n1 < kHalfLength) || (n2 < kHalfLength) || (n3 < kHalfLength) || + (n2_WGS < kHalfLength) || (n3_WGS < kHalfLength)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid grid size : n1, n2, n3, n2_WGS, n3_WGS should be " + "greater than " + << kHalfLength << "\n"; + return true; + } + + if ((n2 < n2_WGS) || (n3 < n3_WGS)) { + std::cout << "--------------------------------------\n"; + std::cout << " Invalid work group size : n2 should be greater than n2_WGS " + "and n3 greater than n3_WGS\n"; + return true; + } + + if (((n2 - 2 * kHalfLength) % n2_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n2 should be multiple of n2_WGS - " + << n2_WGS << "\n"; + return true; + } + if (((n3 - 2 * kHalfLength) % n3_WGS) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n3 should be multiple of n3_WGS - " + << n3_WGS << "\n"; + return true; + } + if (((n1 - 2 * kHalfLength) % kernel_iterations) && kernel_iterations != -1) { + std::cout << "--------------------------------------\n"; + std::cout << " ERROR: Invalid Grid Size: n1 should be multiple of " + "kernel_iterations - " + << kernel_iterations << "\n"; + return true; + } + + return false; +} + +bool CheckWorkGroupSize(sycl::queue& q, unsigned int n2_WGS, + unsigned int n3_WGS) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + if ((max_block_size > 1) && (n2_WGS * n3_WGS > max_block_size)) { + std::cout << "ERROR: Invalid block sizes: n2_WGS * n3_WGS should be " + "less than or equal to " + << max_block_size << "\n"; + return true; + } + + return false; +} + +void printTargetInfo(sycl::queue& q) { + auto device = q.get_device(); + auto max_block_size = + device.get_info(); + + auto max_exec_unit_count = + device.get_info(); + + std::cout << " Running on " << device.get_info() + << "\n"; + std::cout << " The Device Max Work Group Size is : " << max_block_size + << "\n"; + std::cout << " The Device Max EUCount is : " << max_exec_unit_count << "\n"; +} + +void initialize(float* ptr_prev, float* ptr_next, float* ptr_vel, size_t n1, + size_t n2, size_t n3) { + auto dim2 = n2 * n1; + + for (auto i = 0; i < n3; i++) { + for (auto j = 0; j < n2; j++) { + auto offset = i * dim2 + j * n1; + + for (auto k = 0; k < n1; k++) { + ptr_prev[offset + k] = 0.0f; + ptr_next[offset + k] = 0.0f; + ptr_vel[offset + k] = + 2250000.0f * dt * dt; // Integration of the v*v and dt*dt here + } + } + } + // Then we add a source + float val = 1.f; + for (auto s = 5; s >= 0; s--) { + for (auto i = n3 / 2 - s; i < n3 / 2 + s; i++) { + for (auto j = n2 / 4 - s; j < n2 / 4 + s; j++) { + auto offset = i * dim2 + j * n1; + for (auto k = n1 / 4 - s; k < n1 / 4 + s; k++) { + ptr_prev[offset + k] = val; + } + } + } + val *= 10; + } +} + +void printStats(double time, size_t n1, size_t n2, size_t n3, + size_t num_iterations) { + float throughput_mpoints = 0.0f, mflops = 0.0f, normalized_time = 0.0f; + double mbytes = 0.0f; + + normalized_time = (double)time / num_iterations; + throughput_mpoints = ((n1 - 2 * kHalfLength) * (n2 - 2 * kHalfLength) * + (n3 - 2 * kHalfLength)) / + (normalized_time * 1e3f); + mflops = (7.0f * kHalfLength + 5.0f) * throughput_mpoints; + mbytes = 12.0f * throughput_mpoints; + + std::cout << "--------------------------------------\n"; + std::cout << "time : " << time / 1e3f << " secs\n"; + std::cout << "throughput : " << throughput_mpoints << " Mpts/s\n"; + std::cout << "flops : " << mflops / 1e3f << " GFlops\n"; + std::cout << "bytes : " << mbytes / 1e3f << " GBytes/s\n"; + std::cout << "\n--------------------------------------\n"; + std::cout << "\n--------------------------------------\n"; +} + +bool WithinEpsilon(float* output, float* reference, const size_t dim_x, + const size_t dim_y, const size_t dim_z, + const unsigned int radius, const int zadjust = 0, + const float delta = 0.01f) { + std::ofstream error_file; + error_file.open("error_diff.txt"); + + bool error = false; + double norm2 = 0; + + for (size_t iz = 0; iz < dim_z; iz++) { + for (size_t iy = 0; iy < dim_y; iy++) { + for (size_t ix = 0; ix < dim_x; ix++) { + if (ix >= radius && ix < (dim_x - radius) && iy >= radius && + iy < (dim_y - radius) && iz >= radius && + iz < (dim_z - radius + zadjust)) { + float difference = fabsf(*reference - *output); + norm2 += difference * difference; + if (difference > delta) { + error = true; + error_file << " ERROR: " << ix << ", " << iy << ", " << iz << " " + << *output << " instead of " << *reference + << " (|e|=" << difference << ")\n"; + } + } + ++output; + ++reference; + } + } + } + + error_file.close(); + norm2 = sqrt(norm2); + if (error) std::cout << "error (Euclidean norm): " << norm2 << "\n"; + return error; +} + +void inline iso3dfdCPUIteration(float* ptr_next_base, float* ptr_prev_base, + float* ptr_vel_base, float* coeff, + const size_t n1, const size_t n2, + const size_t n3) { + auto dimn1n2 = n1 * n2; + + auto n3_end = n3 - kHalfLength; + auto n2_end = n2 - kHalfLength; + auto n1_end = n1 - kHalfLength; + + for (auto iz = kHalfLength; iz < n3_end; iz++) { + for (auto iy = kHalfLength; iy < n2_end; iy++) { + float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1; + float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1; + float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1; + + for (auto ix = kHalfLength; ix < n1_end; ix++) { + float value = ptr_prev[ix] * coeff[0]; + value += STENCIL_LOOKUP(1); + value += STENCIL_LOOKUP(2); + value += STENCIL_LOOKUP(3); + value += STENCIL_LOOKUP(4); + value += STENCIL_LOOKUP(5); + value += STENCIL_LOOKUP(6); + value += STENCIL_LOOKUP(7); + value += STENCIL_LOOKUP(8); + + ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix]; + } + } + } +} + +void CalculateReference(float* next, float* prev, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + for (auto it = 0; it < nreps; it += 1) { + iso3dfdCPUIteration(next, prev, vel, coeff, n1, n2, n3); + std::swap(next, prev); + } +} + +void VerifyResult(float* prev, float* next, float* vel, float* coeff, + const size_t n1, const size_t n2, const size_t n3, + const size_t nreps) { + std::cout << "Running CPU version for result comparasion: "; + auto nsize = n1 * n2 * n3; + float* temp = new float[nsize]; + memcpy(temp, prev, nsize * sizeof(float)); + initialize(prev, next, vel, n1, n2, n3); + CalculateReference(next, prev, vel, coeff, n1, n2, n3, nreps); + bool error = WithinEpsilon(temp, prev, n1, n2, n3, kHalfLength, 0, 0.1f); + if (error) { + std::cout << "Final wavefields from SYCL device and CPU are not " + << "equivalent: Fail\n"; + } else { + std::cout << "Final wavefields from SYCL device and CPU are equivalent:" + << " Success\n"; + } + delete[] temp; +} diff --git a/DirectProgramming/C++SYCL/StructuredGrids/iso3dfd_dpcpp/sample.json b/DirectProgramming/C++SYCL/StructuredGrids/iso3dfd_dpcpp/sample.json index d4fab6b745..1a4dc6dcbb 100755 --- a/DirectProgramming/C++SYCL/StructuredGrids/iso3dfd_dpcpp/sample.json +++ b/DirectProgramming/C++SYCL/StructuredGrids/iso3dfd_dpcpp/sample.json @@ -8,7 +8,6 @@ "languages": [ { "cpp": {} } ], "os": [ "linux", "windows" ], "builder": [ "ide", "cmake" ], - "targetDevice": [ "CPU" ], "ciTests": { "linux": [{ "steps": [ diff --git a/DirectProgramming/C++SYCL/VisualizedSamples/GameOfLife/sample.json b/DirectProgramming/C++SYCL/VisualizedSamples/GameOfLife/sample.json index 552f631f4a..89c58170e5 100644 --- a/DirectProgramming/C++SYCL/VisualizedSamples/GameOfLife/sample.json +++ b/DirectProgramming/C++SYCL/VisualizedSamples/GameOfLife/sample.json @@ -11,11 +11,11 @@ "ciTests": { "linux": [{ "steps": [ + "apt-get install -y libsdl2-dev", "mkdir build", - "cd build", - "../get_sdl2.sh", - "SDL2_DIR=SDL/install/ cmake ..", - "make" + "cd build", + "cmake ..", + "make" ] }], "windows": [{ diff --git a/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/Mandel.hpp b/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/Mandel.hpp index 6e088dd606..ff21cb9cb9 100644 --- a/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/Mandel.hpp +++ b/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/Mandel.hpp @@ -137,7 +137,7 @@ void Mandelbrot::Calculate(uint32_t* pixels) { if (singlePrecision) CalculateSP(pixels); else - Calculate(pixels); + CalculateDP(pixels); } void Mandelbrot::CalculateSP(uint32_t* pixels) { diff --git a/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/sample.json b/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/sample.json index 38eb1aed8b..3fe9ee3f6a 100644 --- a/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/sample.json +++ b/DirectProgramming/C++SYCL/VisualizedSamples/VisualMandlebrot/sample.json @@ -11,11 +11,11 @@ "ciTests": { "linux": [{ "steps": [ + "apt-get install -y libsdl2-dev", "mkdir build", - "cd build", - "../get_sdl2.sh", - "SDL2_DIR=SDL/install/ cmake ..", - "make" + "cd build", + "cmake ..", + "make" ] }], "windows": [{ diff --git a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_cpu.cpp b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_cpu.cpp index 29e36ff61b..3bcc576158 100644 --- a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_cpu.cpp +++ b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_cpu.cpp @@ -103,7 +103,7 @@ int computeGold(int *gpuData, const int len) { return false; } - int limit = 17; +/* int limit = 17; val = 0; for (int i = 0; i < len; ++i) { @@ -125,7 +125,7 @@ int computeGold(int *gpuData, const int len) { if (val != gpuData[6]) { printf("atomicDec failed\n"); return false; - } + }*/ found = false; diff --git a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_kernel.dp.hpp b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_kernel.dp.hpp index 69951987ed..b30728954e 100644 --- a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_kernel.dp.hpp +++ b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/02_sycl_migrated/Samples/0_Introduction/simpleAtomicIntrinsics/simpleAtomicIntrinsics_kernel.dp.hpp @@ -65,14 +65,14 @@ void testKernel(int *g_odata, const sycl::nd_item<3> &item_ct1) { // Atomic minimum dpct::atomic_fetch_min( &g_odata[4], tid); - +//Note: The DPC++ compiler is currently in the process of incorporating native support for atomic increment/decrement operations, along with ongoing performance enhancements. // Atomic increment (modulo 17+1) - dpct::atomic_fetch_compare_inc( - (unsigned int *)&g_odata[5], 17); + //dpct::atomic_fetch_compare_inc( + // (unsigned int *)&g_odata[5], 17); // Atomic decrement - dpct::atomic_fetch_compare_dec( - (unsigned int *)&g_odata[6], 137); + //dpct::atomic_fetch_compare_dec( + // (unsigned int *)&g_odata[6], 137); // Atomic compare-and-swap dpct::atomic_compare_exchange_strong< diff --git a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/README.md b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/README.md index d24f6452ff..8fd54d575b 100644 --- a/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/README.md +++ b/DirectProgramming/C++SYCL/guided_simpleAtomicIntrinsics_SYCLMigration/README.md @@ -41,8 +41,9 @@ This sample demonstrates the migration of the following prominent CUDA features: - Atomic Intrinsics -The kernel `testKernel` demonstrates SYCL arithmetic atomic functions in device code such as `atomic_fetch_add`, `atomic_fetch_sub`, `atomic_exchange`, `atomic_fetch_max`, `atomic_fetch_min`, `atomic_fetch_compare_inc`, `atomic_fetch_compare_dec`, `atomic_compare_exchange_strong`, `atomic_fetch_and`, `atomic_fetch_or`, and `atomic_fetch_xor` migrated from CUDA atomic instructions. +The kernel `testKernel` demonstrates SYCL arithmetic atomic functions in device code such as `atomic_fetch_add`, `atomic_fetch_sub`, `atomic_exchange`, `atomic_fetch_max`, `atomic_fetch_min`, `atomic_compare_exchange_strong`, `atomic_fetch_and`, `atomic_fetch_or`, and `atomic_fetch_xor` migrated from CUDA atomic instructions. +>**Note**: The DPC++ compiler is currently in the process of incorporating native support for atomic increment/decrement operations, along with ongoing performance enhancements. >**Note**: Refer to [Workflow for a CUDA* to SYCL* Migration](https://www.intel.com/content/www/us/en/developer/tools/oneapi/training/cuda-sycl-migration-workflow.html) for general information about the migration workflow. ## CUDA source code evaluation diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/README.md b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/README.md index 81ed17e6e2..0895471df3 100755 --- a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/README.md +++ b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/README.md @@ -269,25 +269,27 @@ Look at the _Compiler Report > Throughput Analysis > Loop Analysis_ section in t >**Note**: In the sample, applying the optimization yields a total execution time reduction by almost a factor of 4. The Initiation Interval (II) for the inner loop is 12 because a double floating point add takes 11 cycles on the FPGA. ``` -Number of elements: 16000 +Number of elements: 150 +Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0) Run: Unoptimized: -kernel time : 10685.3 ms +kernel time : 0.441344 ms Run: Optimized: -kernel time : 2736.47 ms +kernel time : 0.368128 ms PASSED ``` ### Example Output on FPGA Emulation ``` -Number of elements: 16000 +Number of elements: 150 -Emulator output does not demonstrate true hardware performance. The design may need to run on actual hardware to observe the performance benefit of the optimization exemplified in this tutorial. +Emulator and simulator outputs do not demonstrate true hardware performance. The design may need to run on actual hardware to observe the performance benefit of the optimization exemplified in this tutorial. +Running on device: Intel(R) FPGA Emulation Device Run: Unoptimized: -kernel time : 334.33 ms +kernel time : 0.142848 ms Run: Optimized: -kernel time : 335.345 ms +kernel time : 0.12928 ms PASSED ``` diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/src/loop_carried_dependency.cpp b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/src/loop_carried_dependency.cpp index a4334aa078..9e1cfa3013 100644 --- a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/src/loop_carried_dependency.cpp +++ b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/loop_carried_dependency/src/loop_carried_dependency.cpp @@ -17,11 +17,8 @@ using namespace std; class UnOptKernel; class OptKernel; -#if defined(FPGA_SIMULATOR) -constexpr size_t kMaxN = 200; -#else -constexpr size_t kMaxN = 500; -#endif +constexpr size_t kMaxN = 150; + event Unoptimized(queue &q, const vector &vec_a, const vector &vec_b, double &result, size_t N) { diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/use_library/README.md b/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/use_library/README.md index 8feac96641..1c05acebeb 100755 --- a/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/use_library/README.md +++ b/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/use_library/README.md @@ -29,6 +29,21 @@ This FPGA tutorial demonstrates how to build SYCL device libraries from RTL sour > > :warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation. +> :warning: This tutorial currently exposes a bug in the Windows* version of the Intel DPC++/C++ oneAPI compiler. Make sure you install the Intel DPC++/C++ oneAPI compiler in a path that does not include spaces (for example, `C:\oneAPI\`, or you may see an error message like this when you compile this tutorial: +> ``` +> [100%] Linking CXX executable use_library.report.exe +> Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308 +> Copyright (C) 1985-2024 Intel Corporation. All rights reserved. +> +> 'C:/Program' is not recognized as an internal or external command, +> operable program or batch file. +> Couldn't find section with name '.acl.target'. +> Error: Can't get value into file: 'pkg_editor c:/Users/whitepau/AppData/Local/Temp/use_library-6d1544-4f742f.32024.temp_value.txt get .acl.target c:/Users/whitepau/AppData/Local/Temp/use_library-6d1544-4f742f.32024.temp_value.txt' failed +> +> llvm-foreach: +> icx-cl: error: fpga compiler command failed with exit code 1 (use -v to see invocation) +> ``` + This sample is part of the FPGA code samples. It is categorized as a Tier 3 sample that demonstrates the usage of a tool. diff --git a/Libraries/oneDNN/tutorials/profiling/profile_utils.py b/Libraries/oneDNN/tutorials/profiling/profile_utils.py index 37fc01e3b6..1dc0de0322 100755 --- a/Libraries/oneDNN/tutorials/profiling/profile_utils.py +++ b/Libraries/oneDNN/tutorials/profiling/profile_utils.py @@ -83,29 +83,26 @@ def __init_(self): def load_log(self, log): self.filename = log - self.with_timestamp = True - data = self.load_log_dnnl_timestamp_backend(log) - count = data['time'].count() - - if count <= 1: - data = self.load_log_dnnl_timestamp(log) - count = data['time'].count() - self.with_timestamp = True - if count <= 1: - data = self.load_log_dnnl_backend(log) - count = data['time'].count() + fn_t_list = [self.load_log_dnnl_timestamp_backend, self.load_log_dnnl_timestamp] + fn_not_list = [self.load_log_dnnl_backend, self.load_log_dnnl, self.load_log_mkldnn] + + fn_list = fn_not_list + self.with_timestamp = False + + data = fn_t_list[0](log) + for d in data['timestamp']: + if self.is_float(d) is True: self.with_timestamp = False + fn_list = fn_t_list - if count <= 1: - data = self.load_log_dnnl(log) - count = data['time'].count() - self.with_timestamp = False - - if count == 0: - data = self.load_log_mkldnn(log) + + for index, fn in enumerate(fn_list): + data = fn(log) count = data['time'].count() - self.with_timestamp = False + if count > 2: + print(index) + break exec_data = data[data['exec'] == 'exec'] self.data = data @@ -150,7 +147,7 @@ def load_log_dnnl_backend(self, log): def load_log_dnnl_timestamp_backend(self, log): import pandas as pd - # dnnl_verbose,629411020589.218018,primitive,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb8a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd8b:f0,,alg:convolution_direct,mb1_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,1.21704 + #dnnl_verbose,629411020589.218018,primitive,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb8a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd8b:f0,,alg:convolution_direct,mb1_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,1.21704 data = pd.read_csv(log, names=[ 'dnnl_verbose','timestamp','backend','exec','arch','type', 'jit', 'pass', 'fmt', 'opt', 'alg', 'shape', 'time', 'dummy'], engine='python') return data @@ -160,6 +157,15 @@ def load_log_mkldnn(self, log): print("load_log_mkldnn") data = pd.read_csv(log, names=[ 'mkldnn_verbose','exec','type', 'jit', 'pass', 'fmt', 'alg', 'shape', 'time'], engine='python') return data + def is_float(self, num): + if type(num) is not str: + return False + try: + float(num) + return True + except ValueError: + return False + class oneDNNUtils: diff --git a/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload_optimized.F90 b/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload_optimized.F90 index 7069bbd1e0..b12df7e76b 100644 --- a/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload_optimized.F90 +++ b/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload_optimized.F90 @@ -97,9 +97,7 @@ program solve_batched_linear_systems ! Allocate memory for linear algebra computations allocate (a(stride_a, batch_size), b(n, batch_size*nrhs), & -#if !defined(_OPENMP) ipiv(stride_ipiv, batch_size), & -#endif info_rf(batch_size), info_rs(batch_size), & stat = allocstat, errmsg = allocmsg) if (allocstat > 0) stop trim(allocmsg) @@ -188,9 +186,5 @@ program solve_batched_linear_systems print *, 'Total time:', total_time, 'seconds' ! Clean up -#if defined(_OPENMP) - deallocate (a, b, a_orig, b_orig, x, info_rf, info_rs) -#else deallocate (a, b, a_orig, b_orig, x, ipiv, info_rf, info_rs) -#endif end program solve_batched_linear_systems diff --git a/Libraries/oneMKL/binomial/makefile b/Libraries/oneMKL/binomial/makefile index 9af16e2fa8..9d21ae43d5 100644 --- a/Libraries/oneMKL/binomial/makefile +++ b/Libraries/oneMKL/binomial/makefile @@ -4,7 +4,7 @@ all: binomial_sycl.exe INIT_ON_HOST=/DINIT_ON_HOST=1 !endif -DPCPP_OPTS=-O3 /I$(MKLROOT)\include /DMKL_ILP64 /DVERBOSE=1 /DSMALL_OPT_N=0 /DREPORT_COLD=1 /DREPORT_WARM=1 $(INIT_ON_HOST) -fsycl -qmkl +DPCPP_OPTS=-O3 /I"$(MKLROOT)\include" /DMKL_ILP64 /DVERBOSE=1 /DSMALL_OPT_N=0 /DREPORT_COLD=1 /DREPORT_WARM=1 $(INIT_ON_HOST) -fsycl -qmkl -lOpenCL binomial_sycl.exe: src\binomial_sycl.cpp src\binomial_main.cpp src\binomial.hpp icpx $(DPCPP_OPTS) /DVERBOSE=1 /DSMALL_OPT_N=0 /DREPORT_COLD=1 /DREPORT_WARM=1 src\binomial_sycl.cpp src\binomial_main.cpp /obinomial_sycl.exe diff --git a/Libraries/oneMKL/binomial/src/binomial.hpp b/Libraries/oneMKL/binomial/src/binomial.hpp index 1238b368cb..073ea1be42 100644 --- a/Libraries/oneMKL/binomial/src/binomial.hpp +++ b/Libraries/oneMKL/binomial/src/binomial.hpp @@ -9,10 +9,6 @@ #include -#ifndef DATA_TYPE -#define DATA_TYPE double -#endif - #ifndef VERBOSE #define VERBOSE 0 #endif @@ -45,6 +41,7 @@ constexpr int opt_n = #define __VERSION__ __clang_major__ #endif +template class Binomial { public: Binomial(); @@ -73,4 +70,6 @@ class timer { std::chrono::steady_clock::time_point t1_, t2_; }; +bool is_fp64(); + #endif // __Binomial_HPP__ diff --git a/Libraries/oneMKL/binomial/src/binomial_main.cpp b/Libraries/oneMKL/binomial/src/binomial_main.cpp index a42b48dd43..c1c748427a 100644 --- a/Libraries/oneMKL/binomial/src/binomial_main.cpp +++ b/Libraries/oneMKL/binomial/src/binomial_main.cpp @@ -7,6 +7,7 @@ #include #include #include +#include #include "binomial.hpp" @@ -34,7 +35,8 @@ void BlackScholesRefImpl(double& callResult, callResult = (S * N_d1 - L * std::exp(-r * t) * N_d2); } -void Binomial::check() { +template +void Binomial::check() { if (VERBOSE) { std::printf("Creating the reference result...\n"); std::vector h_call_result_host(opt_n); @@ -64,8 +66,16 @@ void Binomial::check() { } int main(int argc, char** argv) { - Binomial test; - test.run(); - test.check(); + if(is_fp64()){ + Binomial test; + test.run(); + test.check(); + } + else{ + std::cout<<"Warning: could not find a device with double precision support. Single precision is used."< test; + test.run(); + test.check(); + } return 0; } diff --git a/Libraries/oneMKL/binomial/src/binomial_sycl.cpp b/Libraries/oneMKL/binomial/src/binomial_sycl.cpp index 063d4022f8..c82dd65586 100644 --- a/Libraries/oneMKL/binomial/src/binomial_sycl.cpp +++ b/Libraries/oneMKL/binomial/src/binomial_sycl.cpp @@ -14,7 +14,8 @@ constexpr int wg_size = 128; sycl::queue* binomial_queue; -Binomial::Binomial() { +template +Binomial::Binomial() { binomial_queue = new sycl::queue; h_call_result = sycl::malloc_shared(opt_n, *binomial_queue); @@ -45,7 +46,8 @@ Binomial::Binomial() { sycl::event::wait({event_1, event_2, event_3}); } -Binomial::~Binomial() { +template +Binomial::~Binomial() { sycl::free(h_call_result, *binomial_queue); sycl::free(h_stock_price, *binomial_queue); sycl::free(h_option_strike, *binomial_queue); @@ -54,7 +56,8 @@ Binomial::~Binomial() { delete binomial_queue; } -void Binomial::body() { +template +void Binomial::body() { constexpr int block_size = num_steps / wg_size; static_assert(block_size * wg_size == num_steps); @@ -139,7 +142,8 @@ void Binomial::body() { binomial_queue->wait(); } -void Binomial::run() { +template +void Binomial::run() { std::printf( "%s Precision Binomial Option Pricing version %d.%d running on %s using " "DPC++, workgroup size %d, sub-group size %d.\n", @@ -179,3 +183,17 @@ void Binomial::run() { std::printf("Time Elapsed = %10.5f seconds\n", t.duration()); fflush(stdout); } + +bool is_fp64() { + sycl::queue test_queue; + return test_queue.get_device().has(sycl::aspect::fp64); +} + +template DLL_EXPORT Binomial::Binomial(); +template DLL_EXPORT Binomial::Binomial(); + +template DLL_EXPORT Binomial::~Binomial(); +template DLL_EXPORT Binomial::~Binomial(); + +template DLL_EXPORT void Binomial::run(); +template DLL_EXPORT void Binomial::run(); diff --git a/Libraries/oneMKL/black_scholes/makefile b/Libraries/oneMKL/black_scholes/makefile index 869f5652d8..b402788072 100644 --- a/Libraries/oneMKL/black_scholes/makefile +++ b/Libraries/oneMKL/black_scholes/makefile @@ -4,7 +4,7 @@ all: black_scholes_sycl.exe INIT_ON_HOST=/DINIT_ON_HOST=1 !endif -DPCPP_OPTS=-O3 /I$(MKLROOT)\include /DMKL_ILP64 /DVERBOSE=1 /DSMALL_OPT_N=0 $(INIT_ON_HOST) -fsycl -qmkl +DPCPP_OPTS=-O3 /I"$(MKLROOT)\include" /DMKL_ILP64 /DVERBOSE=1 /DSMALL_OPT_N=0 $(INIT_ON_HOST) -fsycl -qmkl -lOpenCL black_scholes_sycl.exe: src\black_scholes_sycl.cpp icpx $(DPCPP_OPTS) src\black_scholes_sycl.cpp /oblack_scholes_sycl.exe diff --git a/Libraries/oneMKL/black_scholes/src/black_scholes.hpp b/Libraries/oneMKL/black_scholes/src/black_scholes.hpp index e56f3db96b..9069403f1a 100644 --- a/Libraries/oneMKL/black_scholes/src/black_scholes.hpp +++ b/Libraries/oneMKL/black_scholes/src/black_scholes.hpp @@ -15,9 +15,9 @@ #define MINOR 6 /******* VERSION *******/ -#ifndef DATA_TYPE -#define DATA_TYPE double -#endif +// #ifndef DATA_TYPE +// #define DATA_TYPE double +// #endif #ifndef VERBOSE #define VERBOSE 1 @@ -47,6 +47,7 @@ constexpr size_t opt_n = #define __VERSION__ __clang_major__ #endif +template class BlackScholes { public: BlackScholes(); @@ -80,7 +81,8 @@ void BlackScholesRefImpl( call_result = (S * N_d1 - L * std::exp(-r * t) * N_d2); } -void BlackScholes::check() +template +void BlackScholes::check() { if (VERBOSE) { std::printf("Creating the reference result...\n"); diff --git a/Libraries/oneMKL/black_scholes/src/black_scholes_sycl.cpp b/Libraries/oneMKL/black_scholes/src/black_scholes_sycl.cpp index b510f99287..c19b9c6f16 100644 --- a/Libraries/oneMKL/black_scholes/src/black_scholes_sycl.cpp +++ b/Libraries/oneMKL/black_scholes/src/black_scholes_sycl.cpp @@ -62,7 +62,8 @@ static inline T CNDF_C(T input) } #endif // USE_CNDF_C -void BlackScholes::body() { +template +void BlackScholes::body() { // this can not be captured to the kernel. So, we need to copy internals of the class to local variables DATA_TYPE* h_stock_price_local = this->h_stock_price; DATA_TYPE* h_option_years_local = this->h_option_years; @@ -100,7 +101,8 @@ void BlackScholes::body() { }); } -BlackScholes::BlackScholes() +template +BlackScholes::BlackScholes() { black_scholes_queue = new sycl::queue; @@ -110,9 +112,6 @@ BlackScholes::BlackScholes() h_option_strike = sycl::malloc_shared(opt_n, *black_scholes_queue); h_option_years = sycl::malloc_shared(opt_n, *black_scholes_queue); - black_scholes_queue->fill(h_call_result, 0.0, opt_n); - black_scholes_queue->fill(h_put_result, 0.0, opt_n); - constexpr int rand_seed = 777; namespace mkl_rng = oneapi::mkl::rng; // create random number generator object @@ -130,7 +129,8 @@ BlackScholes::BlackScholes() sycl::event::wait({event_1, event_2, event_3}); } -BlackScholes::~BlackScholes() +template +BlackScholes::~BlackScholes() { sycl::free(h_call_result, *black_scholes_queue); sycl::free(h_put_result, *black_scholes_queue); @@ -140,7 +140,8 @@ BlackScholes::~BlackScholes() delete black_scholes_queue; } -void BlackScholes::run() +template +void BlackScholes::run() { std::printf("%s Precision Black&Scholes Option Pricing version %d.%d running on %s using DPC++, workgroup size %d, sub-group size %d.\n", sizeof(DATA_TYPE) > 4 ? "Double" : "Single", MAJOR, MINOR, black_scholes_queue->get_device().get_info().c_str(), wg_size, sg_size); @@ -171,9 +172,21 @@ void BlackScholes::run() int main(int const argc, char const* argv[]) { - BlackScholes test{}; - test.run(); - test.check(); + bool is_fp64 = true; + { + sycl::queue test_queue; + is_fp64 = test_queue.get_device().has(sycl::aspect::fp64); + } + if (is_fp64) { + BlackScholes test{}; + test.run(); + test.check(); + } else { + std::cout<<"Warning: could not find a device with double precision support. Single precision is used."< test{}; + test.run(); + test.check(); + } return 0; } diff --git a/Libraries/oneMKL/monte_carlo_european_opt/makefile b/Libraries/oneMKL/monte_carlo_european_opt/makefile index bfa108d332..f4ba4bc147 100644 --- a/Libraries/oneMKL/monte_carlo_european_opt/makefile +++ b/Libraries/oneMKL/monte_carlo_european_opt/makefile @@ -12,8 +12,8 @@ all: montecarlo INIT_ON_HOST=/DINIT_ON_HOST=1 !endif -DPCPP_OPTS=/I"$(MKLROOT)\include" /DMKL_ILP64 $(GENERATOR) -fsycl $(INIT_ON_HOST) -qmkl -montecarlo_main.cpp +DPCPP_OPTS=/I"$(MKLROOT)\include" /DMKL_ILP64 $(GENERATOR) -fsycl $(INIT_ON_HOST) -qmkl -lOpenCL +montecarlo: src/montecarlo_main.cpp icpx src/montecarlo_main.cpp /omontecarlo.exe $(DPCPP_OPTS) clean: diff --git a/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo.hpp b/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo.hpp index d86e22daa6..e91a2293ec 100644 --- a/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo.hpp +++ b/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo.hpp @@ -13,10 +13,6 @@ #include #include -#ifndef DATA_TYPE -#define DATA_TYPE double -#endif - #ifndef ITEMS_PER_WORK_ITEM #define ITEMS_PER_WORK_ITEM 4 #endif @@ -25,8 +21,6 @@ #define VEC_SIZE 8 #endif -using DataType = DATA_TYPE; - //Should be > 1 constexpr int num_options = 384000; //Should be > 16 @@ -34,17 +28,18 @@ constexpr int path_length = 262144; //Test iterations constexpr int num_iterations = 5; -constexpr DataType risk_free = 0.06f; -constexpr DataType volatility = 0.10f; +constexpr float risk_free = 0.06f; +constexpr float volatility = 0.10f; -constexpr DataType RLog2E = -risk_free * M_LOG2E; -constexpr DataType MuLog2E = M_LOG2E * (risk_free - 0.5 * volatility * volatility); -constexpr DataType VLog2E = M_LOG2E * volatility; +constexpr float RLog2E = -risk_free * M_LOG2E; +constexpr float MuLog2E = M_LOG2E * (risk_free - 0.5 * volatility * volatility); +constexpr float VLog2E = M_LOG2E * volatility; template void check(const MonteCarlo_vector& h_CallResult, const MonteCarlo_vector& h_CallConfidence, const MonteCarlo_vector& h_StockPrice, const MonteCarlo_vector& h_OptionStrike, const MonteCarlo_vector& h_OptionYears) { + using DataType = typename MonteCarlo_vector::value_type; std::vector h_CallResultRef(num_options); auto BlackScholesRefImpl = []( diff --git a/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo_main.cpp b/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo_main.cpp index bd93ce430c..fb9b0e047c 100644 --- a/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo_main.cpp +++ b/Libraries/oneMKL/monte_carlo_european_opt/src/montecarlo_main.cpp @@ -18,8 +18,11 @@ template class k_MonteCarlo; // can be useful for profiling +template +class k_initialize_state; // can be useful for profiling -int main(int argc, char** argv) +template +void run() { try { std::cout << "MonteCarlo European Option Pricing in " << @@ -97,7 +100,7 @@ int main(int argc, char** argv) auto rng_states_uptr = std::unique_ptr(sycl::malloc_device(n_states, my_queue), deleter); auto* rng_states = rng_states_uptr.get(); - my_queue.parallel_for( + my_queue.parallel_for>( sycl::range<1>(n_states), std::vector{rng_event_1, rng_event_2, rng_event_3}, [=](sycl::item<1> idx) { @@ -179,5 +182,18 @@ int main(int argc, char** argv) std::cout << e.what(); exit(1); } - return 0; +} + +int main(int argc, char** argv){ + bool is_fp64 = true; + { + sycl::queue test_queue; + is_fp64 = test_queue.get_device().has(sycl::aspect::fp64); + } + if (is_fp64) { + run(); + } else { + std::cout<<"Warning: could not find a device with double precision support. Single precision is used."<(); + } } diff --git a/README.md b/README.md index 2ac3972654..2e7a779bbc 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ Clone an earlier version of the repository using Git by entering a command simil `git clone -b https://github.com/oneapi-src/oneAPI-samples.git` -where `` is the GitHub tag corresponding to the toolkit version number, like **2024.0.0**. +where `` is the GitHub tag corresponding to the toolkit version number, like **2024.1.0**. Alternatively, you can download a zip file containing a specific tagged version of the repository. @@ -82,61 +82,60 @@ The oneAPI-sample repository is organized by high-level categories. ## Platform Validation -Samples in this release are validated on the following platforms. - -### Ubuntu 22.04 -Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz -Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 3.0, (pvc) -Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] -Level Zero driver: Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26918] -oneAPI package version: -\- Intel oneAPI Base Toolkit Build Version: 2024.0.0.49556 -\- Intel oneAPI HPC Toolkit Build Version: 2024.0.0.49582 -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49646 -\- Intel AI Tools 2024.0.0.48873 - -12th Gen Intel(R) Core(TM) i9-12900 -Intel(R) UHD Graphics 770 3.0 ; (gen12, AlderLake-S GT1 [8086:4680]) -Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] -Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.26918] -oneAPI package version: -\- Intel oneAPI Base Toolkit Build Version: 2024.0.0.49556 -\- Intel oneAPI HPC Toolkit Build Version: 2024.0.0.49582 -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49646 -\- Intel AI Tools 2024.0.0.48873 - -11th Gen Intel(R) Core(TM) i7-11700 -Intel(R) UHD Graphics 750 3.0, (gen12, RocketLake) -Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] -Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.26918] -oneAPI package version: -\- Intel oneAPI Base Toolkit Build Version: 2024.0.0.49556 -\- Intel oneAPI HPC Toolkit Build Version: 2024.0.0.49582 -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49646 -\- Intel AI Tools 2024.0.0.48873 +Samples in this release are validated on the following platforms. + +### Ubuntu 22.04 +Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz \ +Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 (pvc) \ +Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] \ +Level Zero driver: Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.28202] \ +oneAPI package version: \ +‐ Intel oneAPI Base Toolkit Build Version: 2024.1.0.596 \ +‐ Intel oneAPI HPC Toolkit Build Version: 2024.1.0.560 \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.743 \ +‐ Intel AI Tools 2024.1.0.84 + +12th Gen Intel(R) Core(TM) i9-12900 \ +Intel(R) UHD Graphics 770 3.0 ; (gen12, AlderLake-S GT1 [8086:4680]) \ +Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000] \ +Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.28202] \ +oneAPI package version: \ +‐ Intel oneAPI Base Toolkit Build Version: 2024.1.0.596 \ +‐ Intel oneAPI HPC Toolkit Build Version: 2024.1.0.560 \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.743 \ +‐ Intel AI Tools 2024.1.0.84 + +11th Gen Intel(R) Core(TM) i7-11700 \ +Intel(R) UHD Graphics 750 3.0, (gen12, RocketLake) \ +Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000] \ +Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.28202] \ +oneAPI package version: \ +‐ Intel oneAPI Base Toolkit Build Version: 2024.1.0.596 \ +‐ Intel oneAPI HPC Toolkit Build Version: 2024.1.0.560 \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.743 \ +‐ Intel AI Tools 2024.1.0.84 ### Windows 11 -12th Gen Intel(R) Core(TM) i9-12900 -Intel(R) UHD Graphics 770 3.0 ; (gen12, AlderLake-S GT1 [8086:4680]) -Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] -Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.27359] -oneAPI package version: -\- Intel oneAPI Base Toolkit Build Version: 2024.0.0.49557 -\- Intel oneAPI HPC Toolkit Build Version: 2024.0.0.49577 -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49649 - -11th Gen Intel(R) Core(TM) i7-11700 -Intel(R) UHD Graphics 750 3.0, (gen12, RocketLake) -Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.10.0.17_160000] -Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.27359] -oneAPI package version: -\- Intel oneAPI Base Toolkit Build Version: 2024.0.0.49557 -\- Intel oneAPI HPC Toolkit Build Version: 2024.0.0.49577 -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49649 +12th Gen Intel(R) Core(TM) i9-12900 Intel(R) UHD Graphics 770 3.0 ; (gen12, AlderLake-S GT1 [8086:4680]) \ +Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000] \ +Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.28597] \ +oneAPI package version: \ +‐ Intel oneAPI Base Toolkit Build Version: 2024.1.0.595 \ +‐ Intel oneAPI HPC Toolkit Build Version: 2024.1.0.561 \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.745 + +11th Gen Intel(R) Core(TM) i7-11700 +Intel(R) UHD Graphics 750 3.0, (gen12, RocketLake) \ +Opencl driver: Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000] \ +Level Zero driver: Intel(R) Level-Zero, Intel(R) UHD Graphics 750 1.3 [1.3.28597] \ +oneAPI package version: \ +‐ Intel oneAPI Base Toolkit Build Version: 2024.1.0.595 \ +‐ Intel oneAPI HPC Toolkit Build Version: 2024.1.0.561 \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.745 ### macOS -Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz -\- Intel oneAPI Rendering Toolkit Build Version: 2024.0.0.49648 +Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz \ +‐ Intel oneAPI Rendering Toolkit Build Version: 2024.1.0.744 ## Known Issues and Limitations diff --git a/RenderingToolkit/IRTK_Learning_Path/0_Introduction_to_Jupyter/Introduction_to_Jupyter.ipynb b/RenderingToolkit/IRTK_Learning_Path/0_Introduction_to_Jupyter/Introduction_to_Jupyter.ipynb new file mode 100644 index 0000000000..e076fee637 --- /dev/null +++ b/RenderingToolkit/IRTK_Learning_Path/0_Introduction_to_Jupyter/Introduction_to_Jupyter.ipynb @@ -0,0 +1,144 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction to JupyterLab* and Notebooks" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you are familiar with Jupyter* you can skip to the first exercise.\n", + "\n", + "